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Abstract 

We present a near-optimal distributed algorithm for (1 -|- o(l))-approximation of single-commodity 
maximum flow in undirected weighted networks that runs in (D -f ^/n) ■ n°F) communication rounds 
in the CONGEST model. Here, n and D denote the number of nodes and the network diameter, re¬ 
spectively. This is the first improvement over the trivial bound of O(n^), and it nearly matches the 
^1{D + y/n) round complexity lower bound. 

The development of the algorithm contains two results of independent interest: 

(i) A {D + y/n) ■ n°*^^k]-ound distributed construction of a spanning tree of average stretch 

(ii) A {D + y/n) ■ n°F^-round distributed construction of an n°T)-congestion approximator consisting 
of the cuts induced by 0(log n) virtual trees. The distributed representation of the cut approximator 
allows for evaluation in {D + ^/n) ■ rounds. 

All our algorithms make use of randomization and succeed with high probability. 


1 Introduction 

Computing a maximum flow is a fundamental task in network optimization. While the problem has a 
decades-old history rich with developments and improvements in the sequential setting, little is known in 
the distributed setting. In fact, prior to this work, the best known distributed time complexity in the standard 
CONGEST model remained at the trivial bound of 0{m), which is the time needed to collect the entire 
topology and solve the problem locally. For undirected networks, this paper improves this unsatisfying state 
to near-optimality: 

Theorem 1.1. On undirected weighted graphs, a (1 + e)-approximation of a maximum s-t flow can be 
computed in {D + ^fn) ■ rounds of the CONGEST model with high probability. 

This round complexity almost matches the Q{D + y/n) lower bound of Das Sarma et al. [14], which 
holds for any non-trivial approximation. Before we proceed, let us formalize the model and the problem. 

1.1 Model and Problem 

Model. We use the standard CONGEST model of synchronous computation [24]. We are given a simple, 
connected, weighted graph G = {V, E, cap), where cap : E ^ N, cap(e) G poly n, are the edge capac- 
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ities.* By D, we denote the (hop) diameter of G. Each of the n := |y| nodes hosts a processor with a 
unique identifier of O(logn) bits, and over each of the m := \E\ edges O(logn) bits can be sent in each 
synchronous round of communication; we assume that nodes have access to infinite strings of independent 
unbiased random bits. We say that an event occurs with high probability (w.h.p.), if it happens with proba¬ 
bility 1 — n~^ for any desired constant c > 0 specified upfronf.^ Inifially, each node only knows ifs idenlifier, 
ifs incidenf edges, and fheir capacities. 

Problem. We fix an arbifrary orienfafion of fhe edges. In fhe following, we wrife (u, v) ^ E if {u, v} £ E 
is direcfed from u fo v. An insfance of fhe (single-commodify) max flow problem is given by, in addition fo 
specifying G, designating a source s £V and a sink t £ V. A (feasible)yiorv is a vector / £ safisfying: 

1. capacify consfrainfs (edges): Me £ E : |/e| < cap(e); 

2. preservation consfrainfs (nodes): Mu£V\ {s, t] : /e - Y.{v,u)&e /e = 0 i and 

3- Y^{s,u)&E /e “ J2{u,s)&E fe = - Yl{t,u)&E /e + J2{u,t)eE fe = F 

Here, E is fhe value of f. A max flow is a flow of maximum value. For e > 0, a (1 + e)-approximate max 
flow is a flow whose value is by af mosf a facfor 1 + e smaller fhan fhaf of a max flow. In fhis work, we focus 
on solving fhe problem of finding a (1 + e)-approximafe max flow in fhe above model, where if suffices fhaf 
each node u learns fe for ifs incidenf edges {u, v} £ E. 

1.2 Related Work 

Nefwork flow, being one of fhe canonical and mosf useful opfimizafion problems, has been fhe fargef of 
innumerable research efforts since fhe 1930s [28] (see, e.g., fhe classic book [2] and fhe recenf survey [15]). 
For fhe general, direcfed case, fhe fasfesf known sequenfial algorifhm is by Goldberg and Rao and if solves 
the max flow problem in time 0{m ■ min Particularly relevant from the point of view of the 

present paper are recent efforts to obtain fast algorithms to compute (approximate) max flow solutions in 
the undirected case. Using the graph sparsification technique of Benczur and Karger [11], any graph can be 
partitioned into k = 0{me^jn) sparse graphs with Oinje^) edges such that the max flow problem can be 
approximately solved by combining max flow solutions for each of these sparse graphs. Using the algorithm 
of Goldberg and Rao, this results in an algorithm with running time 0(mn^/^). In [13], Christiano et al. 
improved this running time to 0(mn^/^) by applying the almost linear-time Faplacian solver of Spielman 
and Teng [31] to iteratively minimize a softmax approximation of the edge congestions. Kelner et al. [16] 
and Sherman [30] independently published two algorithms which allow to compute a (1 -|-e)-approximation 
to an undirected max flow problem in time almost linear in m. Finally, Peng [25] proved the first running 
time in 0(mpolylog(n)). 

However, to the dismay of many, and despite the fact that the word “network” even appears in the 
problem’s name, only little progress was made over the years from the standpoint of distributed algorithms. 
For example, Goldberg and Tarjan’s push-relabel algorithm, which is very local and simple to implement 
in the CONGEST model, requires rounds to converge, where n is the number of nodes. This is 

very disappointing, because in the CONGEST model, any problem whose input and output can be encoded 
with O(logn) bits per edge, can be trivially solved in 0{m) rounds, where m is the number of edges, by 
collecting all input at a single node, solving it there, and distributing the results back. 

Early attempts focused, as customary in those days, on reducing the number of messages in asyn¬ 
chronous executions. For example, Segall [29] gives an 0(nm^)-messages, 0(n^m)-time algorithm for 
exact max flow, and Gafni and Marberg [21] give an an algorithm whose message and time complexities are 

* As merely an approximate flow is required, we can reduce the general case to this setting in 0{{^/n + D) log C) rounds, where 
C is an upper bound on the ratio between the largest and smallest capacity. 

^Taking the union bound over polynomially many events does not affect this property. We will use this fact frequently and 
implicitly throughout the paper. 


2 



0{Ti?m}/‘^). Awerbuch has attacked the problem repeatedly with the following results. In an early work [6] 
he adapts Dime’s centralized algorithm using a synchronizer, giving rise to an algorithm whose time and 
message complexities are O(n^). With Leighton, in [9] they give an algorithm for solving multicommodity 
flow approximately in 0{lm\ogm) rounds, where f < n is the length of the longest flow path. Later he 
considers the model where each flow path (variable) has an “agent” which can find the congestion of all 
links on its path in constant time. In this model, he shows with Khandekar [7] how to approximate any 
positive LP (max flow with given routes included) to within (1 — e) in time polynomial in log(mnAmax/e) 
(here n is the number of variables, which is at least the number of paths considered). The same model is 
used with Khandekar and Rao in [8] , where they show how to approximate multicommodity flow to within 
(1 — e) in 0{l\ogn) rounds. Using a straightforward implementation of this algorithm in the CONGEST 
model results in an 0(n^)-time algorithm. 

Thus, up to the current paper, there was no distributed implementation of a max-flow algorithm which al¬ 
ways requires a sub-quadratic number of rounds. Even an 0(n)-time algorithm would have been considered 
a significant improvement, even for the 0/1 capacity case. 

1.3 Organization of this Article 

Our result builds heavily on a few major breakthroughs in the understanding of max flow in the centralized 
setting, most notably the almost linear-time approximation algorithm for the undirected max flow problem 
by Sherman [30], as well as a few other contributions. We first give an overview of the key concepts in 
Sections 2-4. We carefully revisit Sherman’s approach [30] and the main building blocks he relies on in 
Section 2. This sets the stage for shedding light on the challenges that must be overcome for its distributed 
implementation and presentation of our results in Section 3. There, we also provide a top-level view of 
the components of the algorithm, alongside pointers to the detailed proofs in Sections 5-9 showing that we 
can implement each of them by efficient distributed algorithms. In Section 4, we outline the distributed 
construction of an -congestion approximator, which is our key technical contribution; the role of a 
congestion approximator is to estimate the congestion induced by optimally routing an arbitrary demand 
vector very quickly, which lies at the heart of the algorithm. All the details of our distributed algorithm and 
all the proofs appear in Sections 5-9. 

2 Overview of the Centralized Framework 

Sherman’s approach [30] is based on gradient descent (see, e.g., [22]) for congestion minimization with 
a clever dualization of the flow conservation constraints. The flow problem is re-formulated as a demand 
vector b G such that ~ problem, we have a positive bg and 

negative bt with the same absolute value and the demand is zero everywhere else. The objective is to find a 
flow f* that meets the given demand vector, i.e., the total excess flow in node i is equal to bi, and minimizes 
the maximum edge congestion, which is the ratio of the flow over an edge to its capacity. Formally: 

minimize subject to i?/= 6, (1) 

where C = {Cee')e,e'&E is an m X m diagonal matrix with 

„ _ /cap(e) if e = e' 

~ \ 0 else, 

and B = {Bjjf,)vev,eeE is an n x m matrix with 

{ 1 if e = {u, v) for some u €V 

—1 if e = {v, u) for some u G U 
0 else. 
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Note that given a general (i.e., unconstrained) flow vector f G {Bf)y is exactly the excess flow at node 
V. Hence, by the max-flow min-cut theorem, if we can solve problem (1), a simple binary search will find 
an approximate max flow. 

Instead of directly solving this constrained system, Sherman allows for general flows and adds a penalty 
term for any violation of flow constraints, i.e., 

minimize \\C-^ f\\^ + 2a\\R{b - Bf)\\^ , 

where a > 1 and the matrix R are chosen so that the optimum of this unconstrained optimization problem 
does not violate the flow constraints. As we are interested in an approximate max flow, we can compute an 
approximate solution and argue that the violation of the flow constraints will be small, too. Then one simply 
re-routes the remaining flow in a trivial manner, e.g. on a spanning tree, to obtain a near-optimal solution. 
Finally, to ensure that the objective function is differentiable (i.e., a gradient descent is actually possible), 
II• II is replaced by the so-called soft-max. 

The Congestion Approximator R. The congestion of an edge e (for a given flow f) is defined as the ratio 
|/e|/cap(e). When referring to the congestion of a cut in a given flow, we mean the ratio between the net 
flow crossing the cut to the total capacity of the cut. Suppose for a moment that a = 1 and R contains one 
row for each cut of the graph, chosen such that each entry of the vector RBf equals the congestion of the 
corresponding cut. In particular, R would correctly reproduce the congestion of min cuts (which give rise 
to maximal congestion). Moreover, the vector Rb describes the inevitable congestion of the cuts for any 
feasible flow. Thus, the components of R{b — Bf) are the residual congestions to be dealt with to make / 
feasible (neglecting possible cancellations). The max-flow min-cut theorem and the factor of 2 in the second 
term of the objective function imply that it always improves the value of the objective function to route the 
demands arising from a violation of flow constraints optimally. Moreover, the gradient descent concentrates 
on the most congested edges and those that are contained in cuts with the top residual congestion. In 
particular, flow is pushed over the edges into the cut with the highest residual congestion to satisfy its 
demand until other cuts become more important in the second part of the objective. The first part of the 
objective impedes flow on edges the more they are congested (on an absolute scale and relative to others). 
Thus, approximately minimizing the objective function is equivalent to simultaneously approximating the 
minimum congestion and having small violation of flow constraints; solving up to polynomially small error 
and naively resolving the remaining violations then yields sufficiently accurate results. 

Unfortunately, trying to make R capture congestion exactly is far too inefficient. Instead, one uses an 
a-congestion approximator, that is a matrix R such that for any demand vector b, it holds that 

P&lloo <opt(b) <a||i?&||^, 

where opt(6) is the maximum congestion caused on any cut by optimally routing b. Since the second 
term in the objective function is scaled up by factor a, we are still guaranteed that optimally routing any 
excess demands improves the objective function. However, this implies that the second term of the objective 
function may dominate its gradient and thus emphasis is shifted rather to feasiblity than optimality. Sherman 
proves that this slows down the gradient descent by at most a factor of a^, i.e., if a G so is the number 
of iterations of the gradient descent algorithm that need to be performed. 

Congestion Approximators: Racke’s Construction. For any spanning tree T of G, deleting an edge 
partitions the nodes into two connected components and thus induces an (edge) cut of G. Note that on T, 
this cut contains only the single deleted edge, and in terms of congestion any cut of T is dominated by such 
an edge-induced cut: For any cut, the maximum congestion of an edge is at least the average congestion of 
the cut, and in T, there is a cut containing only this edge. 

These basic properties motivate the question of how well the cut structure of an arbitrary graph can be 
approximated by trees. Intuitively, the goal is to find a tree T (not necessarily a subgraph) spanning all nodes 
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with edge weights such that routing any demand vector in G and in T results in roughly the same maximal 
congestion. Because routing flows on trees is trivial, such a tree T would give rise to an efficient congestion 
approximator R\ R would consist of one row for each cut induced by an edge {u,v) of T with capacity C, 
where the matrix entry corresponding to node m is l/C if m is on tt’s “side” of the cut and 0 otherwise; 
multiplying a demand vector with the row then yields the flow that needs to pass through (u, v) divided by 
the capacity of the cut. 

In a surprising result [27], Racke showed that, using multiplicative weight updates (see e.g. [5, 26, 32]) 
one can construct a distribution of 0{m) trees so that (i) in each tree of the distribution, each cut has at least 
the same capacity as in G and (ii) given any cut of G of total capacity G, sampling from the distribution 
results in a tree T where this cut has expected capacity 0{aG)', here a is the approximation ratio of a low 
average stretch spanning tree algorithm Racke’s construction uses as subroutine. Note that this bound on the 
expectation implies that for any cut of capacity G, there must be a tree in the distribution for which the cut 
has capacity O(aG). Hence, the cuts given by all trees in the distribution give rise to an 0(a)-congestion 
approximator R with 0{mn) rows. 

Low Average Stretch Spanning Trees. In order to perform Racke’s construction, one requires an efficient 
algorithm for computing low average stretch spanning trees. More precisely, given a graph G = {V, E, i) 
with polynomially bounded lengths £ : E ^ N, the goal is to construct a spanning tree T of G so that 

dT{u,v)<a 

{u^v}^E {u^v}^E 

where driu, v) is the sum of the lengths of the unique path from u to u in T and a is the stretch factor. 

Sherman’s algorithm builds on a sophisticated low average stretch spanning tree algorithm that achieves 
a G 0(lognlog^ logn) within 0{m) centralized steps [1]. We use a simpler approach providing a G 
20 (Viog’^iogiogn) pj been shown to parallelize well, i.e., has an efficient implementation in the 

PRAM model [12]. 

Congestion Approximators: Madry’s Construction. Racke’s construction has the drawback that one 
needs to sequentially compute a linear number of trees, which is prohibitively expensive from our point 
of view. Madry generalized Racke’s approach to a construction that results in a distribution over 0{m/j) 
so-called j-trees [19], where j is a parameter. A j-tree consists of a. forest of j connected components (trees) 
and a core graph, which is an arbitrary connected graph with j nodes: 

The properties of the distribution are the same as for Racke’s: sam¬ 
pling from the distribution preserves cut capacities up to an expected 
0(Q;)-factor, where a is the stretch of the utilized spanning tree algo¬ 
rithm. Likewise, using all (dominant) cuts of all y-trees in the distribu¬ 
tion to construct R yields an 0(a)-congestion approximator. Note that 
any cut in a j-tree is dominated by either a cut induced by an edge of 
the forest, or by a cut of the core, in the following sense: Consider any 
demand vector and any “mixed” cut. If there is an edge in the forest 
crossing the cut that has at least the same congestion as the whole cut, 
then the cut induced by the forest edge dominates the mixed cut. Oth¬ 
erwise, we can remove all forest edges from the mixed cut without reducing its congestion. As routing 
demands in the forest part of the graph is trivial, Madry’s construction can be seen as an efficient reduction 
of the problem size. 

Congestion Approximators: Combining Cut Sparsifiers with Madry’s Construction. Using j-trees, 
Sherman derives a suitable congestion approxmiator, i.e., one with a G that can be constructed and 


one from each tree (see Figure 1). 



Figure 1: A j'-tree for j = 5. The 
core links are depicted in brown. 
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evaluated in 0{m + rounds, as follows. First, a cut sparsifier is applied to G. A (1 + e)-sparsifier 

computes a subgraph of G with modified edge weights so that the capacities of all cuts are preserved up 
to factor 1 + e. It is known how to compute a (1 + o(l))-sparsifier with 0{n) edges in 0{m) steps using 
randomization [11]. As the goal is merely to compute a congestion approximator with a G the 

multiplicative 1 + o(l) approximation error is negligible. Hence, this essentially breaks the problem of 
computing a congestion approximator down to the same problem on sparse graphs. 

Next, Sherman applies Madry’s construction with j = n//3, where (3 = 2'^^°®"'. This yields a distribu¬ 
tion of 0(/3) many n//3-trees. The issue is now that the cores are arbitrary graphs, implying that it may be 
difficult to evaluate congestion for cuts in the cores. However, the number of nodes in the core is n' = n/j3. 
Thus, recursion does the trick: apply the cut sparsifier fo fhe core, use Madry’s consfrucfion on fhe resulting 
graph (wifh j' = n'/(3 = n//3^), rinse and repeal. In lolal, fhere are log^g n = \/logn levels of recursion 
unfil fhe core becomes Irivial, i.e., we arrive af a free. For each level of recursion, fhe approximalion ratio 
deleriorales by a mulfiplicalive a G polylog n, where a is fhe sfretch factor of the low-stretch spanning 
tree algorithm, and a multiplicative 1 -|- o(l), for applying the cut sparsifier. This yields an a'-congesfion 
approximafor wifh 

a' G ((1 + o(l))a)'/^ C c . 

While fhe lofal number of conslrucfed frees is still 0{I3^°^P "■) = 0{n), fhe number of nodes in a graph (i.e., 
a core from fhe previous level) on fhe level of recursion is only n//3*“^. The cuf sparsifier ensures fhaf 
fhe number of edges in Ihis graph is reduced lo 0(n//3*“^) before recursing. Since fhe number of edges in 
fhe core is (Irivially) bounded by fhe number of edges of fhe graph in Madry’s consfrucfion, fhe total number 
of sequential computation steps for computing the distribution is thus bounded by 

log/jn 

0{m) + ^ 0(/3* • nlf3^~^) C 0{m + . 

i=\ 

step Complexity of the Flow Algorithm. The above recursive structure can also be exploited to evaluate 
the a'-congestion approximator Sherman uses in steps. As mentioned earlier, the cuts of a j-tree 

are dominated by those induced by edges of the forest and those which are crossed by core edges only 
(cf. Figure 1). In the forest component, routing demands is unique, takes linear time in the number of nodes 
(simply start at the leaves), and results in a modified demand vecfor af fhe core on which is recursed. 

Sherman proves fhaf his algorifhm obfains a (1 -|- e)-approximate flow in 0(e“^a^ log^ n) gradienf 
descenf sfeps, provided R is an a-congesfion approximafor.^ If is sfraighfforward fo see (cf. Secfion 9.1) 
fhaf each of fhese sfeps requires 0{m) compufafional sfeps besides doing fwo mafrix-vecfor mulfiplicafions 
wifh R and R3, respectively. Using fhe above observafion and plugging in fhe time to consfrucf fhe (implicif) 
representation of R, one arrives at a total step complexity of 0(mn°(^^). 

3 Distributed Algorithm: Contribution and Key Ideas 

The Distributed Toolchain. For a distributed implementation of Sherman’s approach, many subproblems 
need to be solved (sufficiently fast) in the CONGEST model. We summarize them in the following list, 
where stars indicate that these components are readily available from prior work. 

* Decomposing trees into 0{^/n) components of strong diameter 0{y/n), within 0{^fn D) rounds. 
This can, e.g., be done by techniques pioneered by Kutten and Peleg for the purpose of minimum- 
weight spanning tree construction [18]. 

^Sherman points out that using Nesterov’s accelerated gradient descent method [23], this can be improved to 0{e~^a log^ n). 
For both his and our results, this difference is insubstantial, as a € G . 
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• Constructing cut sparsifiers. Koutis [17] provides a solution that completes in poly log n rounds of the 
CONGEST model. In Section 6, we give a simulation result for use in the recursive construction. 

1. Constructing low average stretch spanning trees on multigraphs (Section 7). 

2. Applying Madry’s construction in the CONGEST model, even when recursing in the context of 
Sherman’s framework (Section 8). 

3. Sampling from the recursively constructed distribution (Section 8). 

4. Avoiding the use of the entire distribution for constructing the congestion approximator (see below). 

5. Performing a gradient descent step. This involves, e.g., matrix-vector multiplications with R, and 
C~^, evaluation of the soft-max, etc. (Section 9). 

We next present some additional description for the items 1 to 5 in the above list. 

1. Low Average Stretch Spanning Trees. In Section 7, we prove the following theorem. 

Theorem 3.1. Suppose H is a multigraph obtained from G by assigning arbitrary edge lengths in [2”°^^'] 
to the edges of G (known to incident nodes) and performing an arbitrary sequence of contractions. Then we 
can compute a spanning tree of H of expected stretch 2 ‘^(Wog"'log log”-) within (yTr -|- rounds. 

To obtain this theorem, we translate a PRAM algorithm by Blelloch et al. [12] to the CONGEST 
model. The main issue when transitioning from the PRAM to the CONGEST model is that in the PRAM 
model, information about distant parts of the graph may be readily accessed. In the CONGEST model, we 
handle this by pipelining long-distance communication over a global breadth-first-search (BPS) tree of G; 
communication over O(yTr) hops is handled using the edges that have already been selected for inclusion 
into the spanning tree and spanning trees of the contracted regions of G. 

2. Implementing Madry’s Scheme. This is technically the most challenging part. Also here, we have to 
overcome the difficulty of potentially needing to communicate a large amount of information over many 
hops; doing this naively results in too much contention and thus slow algorithms. We approach this by 
modifying Madry’s construction so that: 

• Instead of “aggregating” edges so that the core becomes a graph, we admit a multigraph as core. 

• We do not explicitly construct the core. Instead, we simulate both the sparsifier and the low average 
stretch spanning tree algorithm using the abstraction of cluster graphs (see Section 5). 

• In doing so, we maintain that every core edge is also a graph edge. This enables to handle all commu¬ 
nication over this edge by using the corresponding graph edge. 

• In this context, clusters are the forest components rooted at core nodes. We will maintain that forest 
components have depth 0{^/n). While this is not strictly necessary, it simplifies fhe descripfion of fhe 
corresponding disfribufed algorifhms, as fhe communication wifhin each clusfer can fhen be performed 
via ifs (previously consfrucfed) spanning free. 

• The clusfer hierarchy fhaf is esfablished during fhe construcfion allows for a slraighlforward recursive 
evaluation of fhe corresponding congesfion approximafor. 

Secfion 8 gives fhe defails of fhe consfrucfion. 

3. Sampling from the Distribution. This is now straightforward, because for each sample, on each level of 
the recursion we need to construct only different j-trees for some j. This is also discussed in Section 8, 
in which the formal version of the following theorem is proved. 

Theorem 3.2 (Informal). Within 0{{^/n + D)/3) rounds of the CONGEST model, we can sample a virtual 
tree from the distribution used in Sherman’s framework, where 0(/3) is the number of j-trees in the distribu¬ 
tion constructed when recursing on a core. The distributed representation allows to evaluate the dominant 
cuts of the tree when using it in a congestion approximator within 0{y/n -|- D) rounds. 
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4. Avoiding the use of the entire distribution for constructing the congestion approximator. While 
Sherman can afford to use all trees in the (recursively constructed) distribution, the above theorem is not 
strong enough to allow for fast evaluation of all 0(n) trees. As Madry points out [19], it suffices to sample 
and use O(logn) j-trees from the distribution he constructs to speed up any /^-approximation algorithm 
for an “undirected cut-based minimization problem”, at the expense of an increased approximation ratio of 
2aj3, where a is the approximation ratio of the congestion approximator corresponding to the distribution 
of y-trees. The reasoning is as follows: 

• The number of cuts that need to be considered for such a problem is polynomially bounded. 

• The expected approximation ratio for any fixed cut when sampling from the distribution is a. By 
Markov’s bound, with probability at least 1/2 it is at most 2a. 

• For 0(log n) samples, the union bound shows that w.h.p. all relevant cuts are 2a-approximated. 

• Applying a ^^-approximation algorithm relying on the samples only, which can be evaluated much 
faster, results in a 2a/3-approximation w.h.p. 

Recall that the problem of approximating a max flow was translated to minimizing congestion for demands 
F and —F at s and t and performing binary search over F. The max-flow min-cut theorem implies the 
respective congestion to be the function of a single cut, which can be used to verify that the problem falls 
under Madry’s definition. 

Unfortunately, applying the sampling strategy as indicated by Madry is infeasible in Sherman’s frame¬ 
work. As the goal is a (1 -|- e) -approximation, applying it to the above problem directly will yield a too 
inaccurate approximation. Alternatively, we can apply it in the construction of a congestion approximator. 
However, a congestion approximator must return a good approximation for any demand vector. There are 
exponentially many such vectors even if we restrict h G { — 1,0, l}'^, and we are not aware of any result 
showing that the number of min-cuts corresponding to the respective optimal flows is polynomially bounded. 

We resolve this issue with the following simple, but essential insight, at the expense of squaring the 
approximation ratio of the resulting congestion approximator. 

Lemma 3.3. Suppose we are given a distribution o/poly n trees so that given any cut of G of capacity C, 
sampling from the distribution results in a tree whose corresponding cut has at least capacity C and at most 
capacity aC in expectation. Then sampling 0(log n) such trees and constructing a congestion approximator 
from their single-edge induced cuts results in a 20 ^-congestion approximator of G w.h.p. 

Proof. Recall that a cut approximator estimates the maximum congestion when optimally routing an arbi¬ 
trary demand. Consider any demand vector and denote by C the capacity of the corresponding cut that is 
most congested when routing the demand. As sampling from the distribution yields approximation factor a 
in expectation, there must be some tree T in the distribution whose corresponding cut has capacity at most 
aG. However, this means that when routing the demand via T, there is some edge in T that experiences at 
least 1/a times the maximum congestion when routing the demand optimally in G. As the capacity of the 
edge is at least that of the corresponding cut in G, it follows that the corresponding cut of G has congestion 
at least 1/a of that of the min-cut when routing the demand. 

As there are poly n trees, each of which has n — 1 edges, this shows that for any demand vector there 
is one of polynomially many cuts of G that experience at least 1/a times the maximum congestion when 
optimally routing the demand vector. By Markov’s bound and the union bound, w.h.p. the congestion on 
each of these cuts will be approximated up to another factor of 2a when using 0(log n) samples. □ 

5. Performing a gradient descent step. Most of the high-level operations required for executing a gradi¬ 
ent descent algorithm are straightforward to implement using direct communication between neighbors or 
broadcast and convergecast operations on a BFS tree. The most involved part is multiplying the (implicitly 
constructed) congestion approximator R with an arbitrary demand vector b, and multiplying the transposed 
of the approximator matrix, R^, with a given vector that specifies a cost for each edge of the trees. 


Multiplying by R is done by exploiting that routing on trees is trivial and using standard techniques: dur¬ 
ing the construction, we already decomposed each tree into 0{^/n) components of strong diameter 0{^/n), 
which can be used to solve partially by contracting components, making the resulting tree of 0{^/n) nodes 
globally known, then determine modified demand vectors for the components out of the now locally com¬ 
putable partial solution, and finally resolve these remaining demands within each component. Multiplication 
with is implemented using similar ideas. We refer to Section 9 for a detailed discussion of these proce¬ 
dures. Plugging the building blocks outlined in this section into this machinery, we obtain our main result 
Theorem 1.1. 

4 Outline of Distributed Congestion Approximator Construction 

In this section, we outline how to adapt Madry’s construction to its recursive application in the distributed 
setting. In Section 8, we formally prove that we achieve the same guarantees as Madry’s distribution [19] in 
each recursive step and that our distributed implementation is fast. Here, we focus on presenting the main 
ideas of the required modifications to Madry’s scheme and its distributed implementation; to this end, it 
suffices to consider the construction of a single step of the recursion. 

Centralized Algorithm. As a starting point, let us summarize the main steps of one iteration of the cen¬ 
tralized construction. We state a slightly simplified variant of Madry’s construction, which offers the same 
worst-case performance and is a better starting point for what follows. From the previous step of construct¬ 
ing the distribution, an edge length function f'e is known (in the distributed setting, this knowledge will be 
local). Given j < n — 1, the following construction yields a 0(j)-tree. 

1. Compute a spanning tree T of G of stretch a. 

2. For each edge e = {r;, w} £ E of the graph G, route cap(e) units of a commodity come from n to m 
on (the unique path from v to w in) T.^ Denote by / the vector of the sum of absolute flows passing 
through the edges of T. Recall that maXeg_B{cap(e)} G poly n and thus \\f\\ao £ poly(^)- 

3. For e G T, define the relative load of e as rload(e) := |/e|/cap(e) G poly n. We decompose the edge 

set of T into O(logn) subsets Ei,i G [log(||/||ge, + 1)1 }> where e G Tis in if rload(e) G 

(i?/2*, fori? := maXeg7-{i’lo3'd(e)}. AsThasn —1 > j edges, there must be some with 

Q.{j/ \ogn) edges; let fo be minimal with this property. Define := {e G T| rload(e) > 2*°“^}. 
Note that | < j. 

4. T \ 2^ is a spanning forest of at most j + 1 components. Define H as the graph on node set V whose 
edge set is the union ofT\E and all edges of G between different components of (C, T\IF). 

5. For components G and G' of {V, T\E), pick arbitrary v £ G and w £ G' and denote hy p{G,G') £ G 
the last node from G on the v-w path in T ; note that p{G, G') does not depend on the choice of v and 
w. Denote by P the set of such portals. Replace all edges between different components G, G' of 
{V,T\ P) by parallel edges {p{G, G'),p{G', G)} (of the same weight). 

6. In the resulting multigraph, iteratively delete nodes from 1/\P of degree 1 until no such node remains. 
Note that the leaves of the induced subtree of T must be in P, showing that the number of remaining 
nodes in 1/ \ P of degree larger than 2 is bounded by |P| — 1 < 2j. Add all such nodes to P. 

7. For each path with endpoints in P and no inner nodes in P, delete an edge of minimum capacity and 
replace it by an edge of the same capacity between its endpoints. 

8. Re-add the nodes and edges of T \ P that have been deleted in Step 6. 

9. For any p,q £ P, merge all parallel edges {p, q} into a single one whose capacity is the sum of the 
individual capacities. The result is a j'-tree for j' = |P| < 4j. 

"'The difference to a single commodity is simply that flows in opposing directions do not cancel out. This means that any given 
feasible (i.e., congestion-1) flow in G can be routed on T with at most the congestion of this multi-commodity flow. 
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In his paper, Madry provides a scheme for updating the edge lengths between iterations so that this construc¬ 
tion results in a distribution on 0(m/j) 0(j)-trees that approximate cuts up to an expected 0(Q;)-factor, 
where a is the stretch of the spanning tree construction. Updating the edge length function poses no chal¬ 
lenges, so we will focus on the distributed implementation of the above steps in this section. 

Differences to the Centralized Algorithm. Before we come to the distributed algorithm, let us first discuss 
a few changes we make to the algorithm in centralized terms. These do not affect the reasoning underlying 
the scheme, but greatly simplify its distributed implementation. 

• We will omit the last step of the algorithm and instead operate on cores that are multigraphs. This 
changes the computed distribution, as we formally use a different graph as input to the recursion. 
However, Racke’s arguments (and Madry’s generalization) work equally well on multigraphs, as one 
can see by replacing each edge of the multigraph by a path of length 2, where both edges have the 
same capacity as the original edge. This recovers a graph of 2m edges from a multigraph of m edges 
without affecting the cut structure, and the resulting trees can be interpreted as trees on the multigraph 
by contraction of the previously expanded edges. Similarly, both the low average stretch spanning 
tree construction and the cut sparsifier work on multigraphs without modification. 

• After computing the spanning tree, we will immediately delete a subset of 0(^/n) edges to ensure that 
the new clusters will have low-depth spanning trees. The deleted edges are replaced by all edges of G 
crossing the corresponding cuts and will end up in the core. The same procedure is, in fact, applied to 
all edges selected into T in Step 3 of the centralized routine; Madry’s arguments show that removing 
any subset of edges of T and replacing it this way can only improve the quality of cut approximation. 
The main point of his analysis is that choosing T in the way he does guarantees that, in terms of 
constructing the final distribution of y-trees, progress proportional to the number of edges in TZi^ is 
made. We will apply the construction to cores of size n' 0{^/n), which implies that removing the 
additional edges has asymptotically no effect on the progress guarantee. 

• In the counterpart to Step 6 in Madry’s routine, also nodes from P may be removed if their degree 
becomes 1. Also here, there is no asymptotic difference in the worst-case performance of our routine 
from Madry’s. 

To simplify the presentation, in this section we will assume that all trees involved in the construction 
have depth 0{y/n). This means that we can omit the deletion of 0{y/n) additional edges and further 
related technicalities. The general case is handled by standard techniques for decomposing trees into 0{^/n) 
components of depth 0{^/n) and relying on a BPS tree to communicate “summaries” of the components to 
all nodes in the graph within 0{^/n + D) rounds (full details are given in Section 8). This approach was 
first used for MST construction [18]; we use a simpler randomized variant (cf. Lemma 8.2). 

Cluster Graphs. Recall that we will recursively call (a variant of) the above centralized procedure on the 
core. We need to simulate the algorithm on the core by communicating on G. To this end, we will use 
cluster graphs (see Section 5), in which G is decomposed into components that play the role of core nodes. 
We will maintain the following invariants during the recursive construction: 

1. There is a one-to-one correspondence between core nodes and clusters. 

2. Each cluster c has a rooted spanning tree of depth 0(\/n). 

3. No other edges exist inside clusters. Contracting clusters yields the multigraph resulting from the 
above construction without Step 9. From now on, we will refer to this multigraph as the core. 

4. All edges in the (non-contracted) graph are also edges of G, and their endpoints know their lengths 
from the previous. 

Overview of the Distributed Routine. We follow the same strategy as the centralized algorithm, with 
the modifications discussed above. This implies that the core edges for the next recursive call will simply 
be the graph edges between the newly constructed clusters. The following sketches the main steps of the 
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Figure 2: Illustration of the underlying idea of the aggregation scheme for the cut capacities. The cut corresponding 
to edge (c, c) of the tree has a total capacity given by all graph edges leaving the subtree Tc- By labeling the endpoint 
of graph edge by “+” if it leaves the subtree and by if it connects to a descendant, the cut capacity is thus the sum 
of all capacities of edges labeled “+” minus all those of edges labeled within Tc- 

distributed implementation of our overall approach. 

1. Compute a spanning tree T of stretch a of the core. This is done by the spanning tree algorithm of 
Theorem 3.1, which can operate on the cluster graph. 

2. For each edge e £ T, determine its absolute flow |/e| (and thus rload(e) = |/e|/cap(e)) as follows 
(cf. Figure 2). 

(*) For each cluster c, consider the cut induced by the edge to its parent. For each “side” of the cut, 
we want to determine the total capacity of all edges incident to nodes of c that connect to the 
respective side of the cut. Denote by c+ the total “outgoing” capacity of cluster c towards the 
root’s side and by c_ the “incoming” capacity. 

(a) Each cluster c learns its ancestor clusters in the spanning tree of C. 

(b) Observe that for a cluster c, an edge does contribute to c_ if and only if it connects to a node 
within its subtree Tc- From the previous step, this information is known to one of the endpoints 
of the edge. We communicate this and determine in each cluster c the values c+ and c_ by 
aggregation on its spanning tree. 

(c) Suppose e G T is the edge from cluster c to its parent. Using aggregation on the spanning tree 
of C, we compute 

I/el = • 

c'eTc 

3. Determine the index zq (as in Step 3 of the centralized routine). Given that rload(e) for each e G T is 
locally known, this is performed in 0{D) rounds using binary search in combination with converge- 
and broadcasts on a BFS tree. We set T := {e £ T \ rload(e) > 

4. Define P as the set of clusters incident to edges in T. A simple broadcast on the cluster spanning 
trees makes membership known to all nodes of each cluster c £ P. 

5. Iteratively mark clusters c ^ P with at most one unmarked neighboring cluster, until this process 
stops. Add all unmarked clusters that retain more than 2 unmarked neighboring clusters to P. 

6. For each path with endpoints in P whose inner nodes are unmarked clusters not in P, find the edge 
e £ T\T of minimal capacity and add it to T. This disconnects any two clusters c, d £ P, c ^ d, 
mT\T- 

7. Each component of T\ 7^ and the spanning trees of clusters induce a spanning tree of the correspond¬ 
ing component of G. Each such component is a new cluster. Make the identifier of the unique c £ P 
of each cluster known to its nodes and delete all edges between nodes in the cluster that are not part 
of its spanning tree. 
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If all trees have depth 0(^/n), all the above steps can be completed in 0{;\fn + D) rounds. Clearly, the first 
3 stated invariants are satisfied by the given construction. As mentioned earlier, it is also straightforward 
to update the edge lengths, i.e., establish the fourth invariant. Once the distribution on the current level of 
recursion is computed, one can hence sample and then move on to the next level. 

For the detailed description of the algorithm, the recursion, a formal statement of Theorem 3.2, and the 
respective proofs, we refer to Section 8. 

5 Cluster Graphs 

On several levels, our distributed congestion approximator construction is done in a hierarchical way. As a 
consequence many of the distributed computations used by our algorithm have to be run on a graph induced 
by clusters of the network graph. In order to be able to deal with such cluster graphs in a systematic way, we 
formally define cluster graphs and we describe how to simulate distributed computations on a cluster graph 
by running a distributed algorithm on the underlying network graph. 

Definition 5.1 (Distributed Cluster Graph). Given a n-network graph G = {V,E), a distributed N-node 
cluster graph Q = (V, <5, £, T, V') of size n is defined by a set of N clusters V = {5i,..., S^} partitioning 
the vertex set V, a set (or multiset) of edges £ C ( 2 )- o set of cluster leaders C, a set of cluster trees 
% as well as a function ■0 that maps the edges £ of the cluster graph to edges in E. Formally, the tuple 
(V, £, £, T, V’) has to satisfy the following conditions. 

(I) The clusters V = (S'!,..., Sm) form a partition of the set of vertices V of the network graph, i.e., 
yie[N]: SiCV,yi<i< j <N : SiH Sj = 0, and U*=i Si = V. 

(II) For each cluster Si, |(S'jnC| = 1. Hence, each cluster has exactly one cluster leader ii G Jens'*. The 
ID of the node ii also serves as the ID of the cluster Si and for the purpose of distributed computations, 
we assume that all nodes V G Si know the cluster ID and the size rii := |S*| of their cluster Si. 

(III) Each cluster tree Ti = (S*, Ei) is a rooted spanning tree of the subgraph G[S*] ofG induced by S*. 
The root ofTi is the cluster leader £* G S* n £. Vie assume that each node ofu€Si\ {ii} knows its 
parent node v € Si in the tree Ti. 

(IV) The function psi :£—)•£ maps each edge {Si, Sj} £ £ to an (actual) edge {vi, Vj} £ E connecting 
the clusters Si and Sj, i.e., it holds that n* G S* and Vj £ Sj. The two nodes Vi and vj know that the 
edge {vi, Vj] is used to connect clusters Si and Sj. If the cluster graph is weighted, the two nodes Vi 
and Vj also know the weight of the edge {S*, Sj}. 

Note that (III) in particular implies that the subgraph of G induced by each cluster S* is connected. When 
dealing with a concrete distributed cluster graph Q, we use Q^, Qs, Qc, Q%, and Qpsi to denote the corre¬ 
sponding sets of clusters, edges, etc. Further, when only arguing about the cluster graph and not its mapping 
to G, we only use the pair (V, S) to refer to it. In the following, we say that a cluster S G V knows something 
if all nodes v £ S know it. That is, e.g., the last part of condition (II) says that every cluster knows its ID 
and its size. 

We next define a weak version of the (synchronous) CONGEST model and we show that algorithms in 
this model can be efficiently simulated in distributed cluster graphs. 

Definition 5.2 (B-Bounded Space CONGEST Model). Let B = D(logn) be a given parameter. The B- 
Bounded Space CONGEST model is a computation model which restricts the CONGEST{B) model by 
requiring, for any d > 0, that each step of a node v in the B-Bounded Space can be emulated in O (d) rounds 
by any tree T{v) of depth d, where the edges incident on v are incident on nodes ofT(v) in the emulation. 
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The definition of the i?-Bounded Space model is directed toward emulation. The definition immediately 
implies that if each node is emulated by a tree, then emulating a global step in time proportional to the 
maximal tree depth (which could be n(n)) is trivial. However, the following lemma shows that this can 
actually be done in time 0(0 + ^/n)- 

Lemma 5.1. Given an underlying n-node graph G = (L, E) and a cluster graph Q = (V, C,, %,psi), a 

t-round distributed algorithm A in Q in the B-bounded space CONGEST model can be simulated in the 
(ordinary) CONGEST model in G with messages of size at most B in 0{(yD + y/n) ■ t) rounds, where D 
is the diameter ofG. 

Proof Sketch. We assume that we are given a global BPS tree of G. If such a BPS tree is not available, it can 
be computed in 0(D) rounds in the CONGEST model. We simulate the algorithm .4. in a round-by-round 
manner. Consider the end of the simulation of round r — 1 and assume that in each cluster Si G V, the leader 
node li knows the message Mi to be sent in round r. (Por r = 1, we assume that this is true at the beginning 
of the simulation.) 

To start the simulation of round r, we first make sure that for every Si G V, every node v £ Si knows the 
message Mi to be sent to the neighbors in round r. In clusters Si of size at most y/n, this can be done in at 
most y/n rounds by broadcasting Mi on the spanning tree Tj of G[5'j]. Por larger clusters, we use the global 
BPS tree to disseminate the information. We first send all the messages Mi of clusters Si of size larger than 
y/n to the root of the global BPS tree. Because the BPS tree has radius at most D and because there are at 
most y/n clusters of size larger than y/n, this can be done m D + y/n rounds (using pipelining). Now, in 
another D + y/n rounds, all these messages can be broadcast to all nodes of G (and thus also to the nodes 
of the clusters that need to know them). 

Now, for every two clusters Si and Sj such that {Si,Sj} G £, let {ui,Uj} = psi({Si, Sj}) be the 
physical edge connecting Si and Sj. The node Ui G Si sends the message Mj to Uj and the node Uj G Sj 
sends Mj to Ui. This step can be done in a single round. Now, in each cluster Si, each incoming message 
of round r is known by one node in Si and we need to aggregate these messages in order to compute the 
outgoing message of each cluster. In clusters of size at most y/n, this can again be done locally inside 
the cluster (by Definition 5.2). Also, for the at most y/n clusters of size larger than y/n, we again use the 
global BPS tree. Since in a tree of depth D, k independent convergecasts of broadcasts can be done in time 
D + k, the messages of the large clusters can be computed in disseminated to the cluster leaders in time 
0{D+y/E). □ 

6 Distributed Construction of Cut Sparsifiers 

Lemma 6.1. In a weighted N-node distributed cluster graph of size n, for any e > 0, it is possible to 
compute a spectral (1 -|- e)-sparsifier with 0(N ■ • log edges (w.h.p.) in the CONGEST model 

in time 0((D -I- y/n) ■ • log When the algorithm terminates, each of the edges of the sparsifier 

is directed such that the out-degree of each cluster is upper bounded by 0({e~^ ■ log and such that 

each cluster knows all its outgoing edges. 

Proof. We prove the lemma using the algorithm PARALLELSPARSIFY of Koutis [17], and then orient the 
edges. Koutis’s algorithm relies on the 0(log n)-stretch spanner construction algorithm of Baswana and 
Sen [10], which we henceforth refer to as BS. See Pigure 3 for a description of the BS algorithm. 

We start by showing how to emulate a step of node v in BS by a depth-d tree in 0(d -|- log N) time 
(w.h.p.). We may assume w.l.o.g. the existence of a root in each tree (because we can select one in 0(d) 
time); Step 2a is carried out by the root and the result is broadcast over the tree. Por Step 2b, we note that 
w.h.p., \Qv\ = 0(log N) and hence making it known to all node Ty takes 0(d log N) time using standard 
convergecast-broadcast. Step 3 is straightforward given that each node knows its cluster. 
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1. Ro := {{v} I z; e y}. 

2. For i ;= 1 to log N do; 

(a) Mark each cluster of Ri-i independently with probability let Ri := {S' is marked | S G Ri-i}. 

(b) If u G S for some S G Ri-i \ Rf. 

i. Define Qy to be the set of edges that consists of the lightest edge from v to each of the clusters 
in Ri V is adjacent to. 

ii. If V has no neighbor in a cluster in Ri, then v adds to the spanner all edges in Qy. 

iii. Otherwise, let u be the closest neighbor of u in a marked cluster. Then 

• V joins the cluster of u (i.e., if u is in cluster S' G Ri, then S' := S' U ji'}). 

• V adds to the spanner the edge {v, u}, and also all edges {v, w} G Qy with W {v, w) < 

W {v, u) (breaking ties by ID). 

3. Each node v adds, for each cluster S G Riog n it is adjacent to, the lightest edge connecting it to S. 

Figure 3: The BS algorithm for O(logiV) spanner construction given an iV-node weighted graph G = (V, E, W). 
The output is a subset of E. 


Next, given a tree T{v) for each node, we assume that the depth of all trees is at least clogA^ for 
some appropriate constant c > 0. If this assumption does not hold we extend T{v) with a dummy path: 
clearly T{v) can emulate the extended tree without any slowdown. However this extension may increase 
the number of nodes by an 0(log Af) factor. Now, under this assumption and the emulation above, we may 
apply Lemma 5.1 to conclude that BS can be executed in time 0{{D + \/N log N) logN). 

Going back to the algorithm of Koutis [17], we note that it consists of (log invocations of BS, 

and some independent random selection and reweighting of edges. The former is discussed above, and the 
latter is trivial to emulate locally. 

Finally, for edge orientation, we give a little algorithm that, given an A^-node, D-diameter graph with 
average degree orients all edges such that the out degree of all nodes is 0{day)- The algorithm runs in 
0{D + logn) steps in the space-bounded CONGEST model. The algorithm is as follows. First compute 
the average degree in 0{D) time, and then repeat the following procedure log n times at each node v\ 

• If the number of unoriented edges incident on v is less than 2day, then v orients all unoriented edges 
outward, informs its neighbors, and halts. 

The correctness of the procedure follows from the fact that throughout the execution, at most half the non- 
halted nodes have degree larger than 2dav- The lemma now follows from the fact that the graph generated 
by Koutis’ algorithm has average degree (l2S_^)0(i) □ 

7 Distributed Construction of Low Average-Stretch Spanning Trees 

Theorem 3.1. (restated and rephrased) Suppose H is a multigraph obtained from G by assigning arbitrary 
edge lengths in to the edges of G (known to incident nodes) and performing an arbitrary sequence 

of contractions. Then we can compute a rooted spanning tree of H of expected stretch 2 '^(v'iog”-iog logn) 
within (y/n + rounds, where the edges of the tree in H and their orientation is locally known to the 

endpoints of the corresponding edges in G. 

Proof. We follow [12]: the high-level algorithm is by Alon et el. [3], which uses Algorithm Partition (of 
[12]) for unweighted graphs. We describe the algorithm bottom-up. The main component in Algorithm 
Partition is Algorithm SplitGraph, reproduced in Figure 4. The basic action of Algorithm SplitGraph is 
growing BFS trees, an action in which emulating a single node by a tree is trivial. In SplitGraph we may 
have contending BFS growths, but note that if two or more BFS traversals collide, only the winning ID 
needs to proceed, and hence there are no collisions because no edge needs to carry more than a single BFS 
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1. := G-,C:= 0. 

2. For t = 1 to 2 log N do: 

(a) Let random subset of of nodes; if is smaller than ^^^*72 > ‘S'* := ^*- 

(b) C :=CU {{s} I s G S^]. 

(c) Each s £ draws a random delay 6l uniformly from [0, [p /(2 log N)\]. 

(d) Each s £ waits 6^ rounds and then initiates a BES for p{l — 2 logN ) “ rounds in G^. 

(e) A node covered by a BES is added to cluster Gg, where s is the source of the first BES to visit it, 

breaking ties by ID. 

(f) fAt+i := E \ {^; I e C for some C £ C}; := G^[V^+^]. 

Eigure 4: Algorithm SplitGraph. The input is an unweighted graph G = {V, E) and a target radius p. 

traversal in each direction. Regarding the tree construction, we first note that the BES growth naturally 
creates a spanning tree for each cluster. Moreover, we can make all nodes know the complete path to the 
root of their respective cluster in additional 0{p) steps, by letting each node send its z*** ancestor to all 
its children in round i. The running time of Algorithm SplitGraph is clearly 0{p\ogN) in the bounded- 
space CONGEST model, and therefore, using Eemma 5.1, we conclude that we can run SplitGraph in time 
0{p log N{D + y/N)) in the CONGEST model. 

Algorithm SplitGraph is called by Algorithm Partition, whose input is an unweighted graph with an ar¬ 
bitrary partition of the edges into K classes. Algorithm Partition applies Algorithm SplitGraph disregarding 
classes, and then checks whether there exists a class where too many edges were split in different clusters. 
If there is such an over-split class, the algorithm is restarted. We can implement each checking and restart 
in the CONGEST model in 0{D -I- K) time using a global BPS tree. Since the number of restarts is 
bounded by 0(log N) w.h.p. [12], and in our implementation we shall have K = 0{VN), the overall time 
for running Algorithm Partition is 0{plog^ N{D + VN)) in the CONGEST model. 

The outermost algorithm is the one by Alon et al. [3] , whose input is a weighted graph. The algorithm 
first partitions the edges into 0(\/log N) classes by weight, where class Ei contains all edges whose weight 
is in z*) for a certain value 2 ; = Then the algorithm proceeds in iterations until 

the graph is a single node, where iteration j is as follows. 

1. Call Algorithm Partition with edges Ei,... ,Ej and target radius p = z/i. Obtain clusters {Cj}. 

2. Output a BPS tree for each cluster Cj. 

3. Contract each resulting cluster Gi to a single node. Remove all self loops, but leave parallel edges in 
place. The resulting multigraph, augmented with edge class Ej^i, is the input to iteration j + 1. 

Por the distributed implementation, note that edge contraction is trivial given that the endpoints know the 
identity of the cluster they belong to, and that edge classification is purely local given 2 ; (which can be 
communicated to all in 0(D) time units). It can be shown [12] that w.h.p., the number of iterations is 
0(log A/"v/log N log log N), and hence the running time of the algorithm is 0(plog A log*^^^^ N{D + 
VN)) = log a • 20 (WogiViogiogAr) because p = 0 ( 2 V 6 iogAr-iogiogtV)_ 

The claimed stretch follows from [12]. □ 


8 Distributed Construction of j-Trees 

Prom the distributed implementation point of view, the core technical challenge is to efficiently compute 
a congestion approximator in a distributed way. As already pointed out, the congestion approximator is 
constructed based on applying the j-tree construction of Madry [19] recursively. In the following, we review 
Madry’s construction and we show how to implement an adapted version of it in a distributed network. The 
main objective of the construction is to approximate the flow structure of a given graph by a distribution of 
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graphs from a simpler class of graphs (i.e., j-trees). Formally, the similarity of the flow structure of two 
graphs is captured by the following definition from [19]. 

Definition 8.1 (Graph Embeddability). [19] Wh are given /3 > 1 and two (multi)-graphs G = {V, E, cap) 
and G' = {V, E', cap') on the same set of nodes and with edge capacities cap(e) and cap'(e') for edges 
e € E and E G E'. We say that graph G is ^-embeddable into G' if there exists a multicommodity 
flow f = {ffjedE such that for every edge e G E of G connecting nodes u and v, /g is a flow on 
(V, E' B cap') that routes cap(e) units of flow between u and v, and for every edge e' G E' of G’, it holds 
that\f\e')\ :=EeeE\ife)ie')\ < f3ce.p'ie'). 

Intuitively, a graph G is /3-embeddable into a graph G', if for every (multicommodity) flow problem, there 
is a solution in G' such that the maximum relative congestion of all edges is by at most a factor B larger than 
for the optimal solution in G. As a generalization of the cut-based graph decompositions of Racke [27], 
Madry defines the notion of an {a, G)-decomposition. 

Definition 8.2 ((a, G)-Decomposition [19]). Given a (multi-)graph G = (G, i?, cap) and a family G of 
graphs on the nodes V, an {a, G) of G is a set of pairs {(A,, Gi)}^^j satisfying that: 

• \/i £ I : Aj > 0; 

• Zlie/ 

• Mi £ I •. Gi = (V, Ei, capB is a graph in G; 

• Vi £ I : G is 1-embeddable into G; and 

• the graph defined by the convex combination^ a-embeddable into G. 

In words, {(Aj, Gi)}^^j is a distribution on I graphs from G, each of which can be 1-embedded into G, such 
that the distribution a-embeds into G. 

Observe that such a decomposition can form the basis for a good congestion approximator: 1-embeddability 
of each Gi into G guarantees that congestion is never overestimated, and the embeddability of the convex 
combination ensures when sampling from the distribution, the expected factor by which we underestimate 
congestion on a cut is at most a. Our goals are now to choose G and the distribution such that 

• a is small, 

• we can construct the distribution efficiently, and 

• we can evaluate the induced congestion when routing demand optimally on a graph from the distribu¬ 
tion efficiently. 

The Plan. Let G = {V, E, cap) be a weighted (multi-)graph, 0 < j < |E| be an integer and let J be the 
family of j-trees over the node set V. In [19], it is shown that based on a protocol for computing spanning 
trees with average stretch a, there exists an (a, JI)-decomposition of G. This is shown in several steps. It is 
first shown that a sparse (a, H)-decomposition exists for a graph family H which contains graphs that are 
closer to the original graph G and it is then shown that every graph H £ M can be 0(l)-embedded into a 
j-tree and vice versa. 

As described, we have to apply the j-tree construction recursively to the core graph. Each node in the 
core graph is represented by a set of nodes (a cluster) in the network graph. On the network graph, the core 
graph therefore corresponds to a graph between clusters of nodes. We therefore have to be able to apply 
the j-tree construction on a cluster graph. As we will see, we can construct j-trees such that whenever two 
nodes u and v of the core are connected by a (virtual) edge, there also is a physical edge between the two 
trees (i.e., clusters of nodes) corresponding to u and v. Throughout our algorithm, we can therefore work 
with a cluster multigraph such that a) the induced graph of each cluster is connected and b) for every edge 

^The sum of two weighted graphs Gi = (V, Ei, capj^) and G2 = {V, f?2, capj) is defined as Gi -I- G2 = (V, Ei UE2, capj^j)" 
where for each edge e G EiU E2, capj2(e) is defined as capj2(e) = capj(e) + cap2(e) if e £ f?i n f?2 and capj^jl^) = capfie) 
if e £ i?i \ Es-i for i £ { 1 , 2 }. 
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between two clusters c and c', there are nodes u € c and v € c' such that u and v are connected by an edge in 
the underlying network graph. For doing distributed computations, we assume that each cluster has a leader 
and that every node knows the ID of the leader and also its parent in a rooted spanning tree which is rooted at 
the leader. In Section 5, we give a precise definition of a distributed cluster graph and we show that several 
basic algorithms that we use as building blocks can be run efficiently in distributed cluster graphs. 

In the following, we go through Madry’s j-tree construction step-by-step and describe how to adapt it 
so that we can implement it efficiently on a distributed cluster graph (i.e., in the CONGEST model in the 
underlying network graph). 

8.1 Low-Stretch Spanning Trees 

In the following, we consider the computation of the (a, Jf)-decomposition of some core graph. Assume that 
the core graph is given as a distributed cluster graph Q = (V, S, cap), where for each edge e £ S, cap(e) 
is the capacity of e. As the time complexity of some of the steps for computing an (a, J)-decomposition 
of Q depend on the number of edges of Q, as a first step, we sparsify Q. In Lemma 6.1, it is shown that in 
0[{D + yjn) ■ polylogn) rounds, it is possible to compute an (1 -|- 1/polylogn)-spectral sparsifier of Q 
with at most 0(| V| • polylog n) edges. Further, for each edge {c, c'} G <5 of the sparsifier one of fhe nodes 
of fhe edge manages fhe edge. As in general ^ is a clusfer graph, c and d are clusfers of physical nodes and 
an edge connecfing clusfers c and d is represenfed by a physical edge {u, u} for fwo (nefwork) nodes u £ c 
and V £ d . We will mainfain fhaf every pair of nodes u £ c and v £ d needs fo represenf af mosf one edge 
befween c and d in Q. The fwo nodes u and v know abouf fhe edge befween c and d and ifs capacify. 

In fhe following, we assume fhaf Q is fhe graph affer sparsificafion. If fhe number of nodes I'Ll of ^ is 
less fhan , using a global BFS free of fhe nefwork graph, fhe whole sfrucfure of Q can be collecfed 

in 0{D + |V| polylogn) = 0{D + rounds. In fhaf case, we can fherefore perform all remaining 

operafions locally af fhe nodes. Consequenfly, we will henceforth assume fhaf |V| > . 

During fhe consfrucfion of fhe (a, J)-decomposition of Q, each edge e G is assigned a lengfh ^{d). Af 
fhe beginning l{e) is proportional fo l/cap(e) and before adding each j-free, l{e) is adapfed for each edge. 
As fhe firsf sfep of consfrucfing each j-free in fhe decomposifion, Madry compufes a spanning free T oiQ 
for which if holds fhaf 

d'Y{c, d) • cap(e) = stretch 7 -(e) • ^(e) • cap(e) < 5a • l{e) ■ cap(e) (2) 

e={c,c'}&e e={c,c'}&e e={c,c'}e£ 

for a sufficienfly small posifive consfanf 5. In fhe above expression, d'p{u, v) denofes fhe sum of edge lengfhs 
on fhe pafh befween u and v on T. Hence, T is a spanning free wifh a bounded weighfed average sfrefch. 
Such a spanning free can be compufed by compufing an (unweighfed) low average sfrefch spanning free 
for a mulfigraph Q which is obfained from Q by (logically) replacing some of fhe edges of Q wifh multiple 
copies of fhe same edge (overall, fhe number of edges is as mosf doubled) [4, 19]. 

In our disfribufed implemenfafion of Madry’s j-free consfrucfion, we adapf fhe parallel low average 
sfrefch spanning free algorifhm from [12] fo our seffing. The algorifhm of [12] already works in a mosfly 
decenfralized fashion and we can fherefore also apply if in a disfribufed seffing. In Secfion 7, we show how 
fo run fhe algorifhm of [12] on a disfribufed clusfer graph. We nofe fhaf fhe low average sfrefch spanning 
free algorifhm of [12] direcfly folerafes multi-edges as described above, even if fhe same physical edge has 
fo be used fo represenf multiple edges befween fhe same clusfers. 

Theorem 3.1. (restated and rephrased) Suppose Q is a multigraph obtained from G by assigning arbitrary 
edge lengths in to the edges of G (known to incident nodes) and performing an arbitrary sequence 

of contractions. Then we can compute a rooted spanning tree of Q of expected stretch logn) 

within (yTi -|- D)nd^^ rounds, where the edges of the tree in Q and their orientation is locally known to the 
endpoints of the corresponding edges in G. 
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Given the spanning tree T = (V 7 -, £t) of G, we need to eompute eapaeities cap 7 -(e) for the spanning 
edges sueh that Q is embeddable into T, whieh essentially boils down to eomputing a the absolute value 
of the multieommodity flow /' routing cap(e) units of flow on T for eaeh e ^ E. As routing is trivial in 
trees, f' is unique. Onee \f'\ is eomputed, the edge eapaeities of T ean be ehosen aeeordingly and it is 
straightforward to piek a suitable Aj and update the length funetion £{e). However, eomputing the absolute 
value of the multieommodity fiow fast in the distributed setting requires some work. 

Computing the Multicommodity Flow. In the following, assume that the edges of T are oriented towards 
the root, i.e., we will write (c, c) G T if c G V is the parent of eluster c G V. Denote by Tc the subtree of T 
rooted at c. When embedding Q into T, we have to route a total of 

l/'(c,c)|=^ ^ cap({ci,C2}„„) 

Cl&Tc {ci,C 2 } UV ee 

eommodity through the edge {c, c} G T, where we indexed the different edges of the multigraph (C, 1 ?) by 
using that for eaeh e G <5 between ci and C 2 , f maps e to a unique edge {u, v} ^ E with u G ci and w G C 2 . 
Note that u and v know that {ci,C2}uv £ £, that they are in the elusters ci and C 2 , respeetively, and what 
cap({ci, C2]uv) is. We thus have to solve the task of determining this sum for eaeh edge {c, c} G T via 
eomputations on the graph G underlying {V,£). 

Observe that the spanning trees of the elusters together with T induee a (rooted) spanning tree T of G. 
Essentially, we would like to perform, for eaeh edge {c, c} G T, an aggregation on T and pipeline these 
aggregations to aehieve good time eomplexity. However, as shown in the following lemma, the result is a 
running time linear in the depth of the tree, whieh may be 0(n) irrespeetive of D. 

Lemma 8 . 1 . IfT has depth d,for each edge e = (c, c) £ T, c can determine \f'{e)\ within 0 {d) rounds. 

Proof. Consider the following algorithm. 

1 . For eaeh eluster, all of its nodes learn the aneestors of the eluster in T. 

2. For eaeh edge {ci,C2}uv £ £,u and v exehange the aneestor lists of ci and C 2 . 

3. Eaeh node u £ ci £ T loeally eomputes for eaeh aneestor c of ci the value 

capc(u) := ^ cap({ci,C2}nt,). 

{ci,C2} UV 

c is not aneestor of C 2 

4. For eaeh edge e = (c, c) G T, we aggregate where Tc is the subtree of T 

eorresponding to Tc. 

Observe that, by definition, 

l/'(c,c)|=^ ^ cap({ci,C2}„^) = ^ capc(n), 

C\£Tc {c\,C2}uvClS UdTc 

C2^T 

as eaeh edge {ci, 02} uv G £ with ci £ % and C2 £ T\Tc satisfies that either u £ Tor v £ T- Henee, it 
remains to show that the above routine ean be implemented with a running time of 0 {d). 

Clearly, the first step takes d rounds: T has depth d, and we ean perform eoneurrent floodings on 
all subtrees without eausing eontention. The seeond step requires at most d — 1 rounds, as no node has 
more than d — 1 aneestors. The third step requires loeal eomputations only. Finally, the fourth step ean 
be performed in d rounds as well, sinee we ean perform eoneurrent eonvergeeasts on all subtrees without 
eausing eontention. □ 
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To handle the general case, i.e., d » ^/n, we first decompose T into 0(^/n) parts of small diameter. 
There are different ways to achieve such a decomposition of T efficiently in a distributed way (e.g., by 
using techniques from [18]). The easiest way is to use randomization. Suppose c' is the parent cluster 
of non-root cluster c. We sample edge e = (c, c) G T into the edge set TZ with independent probability 
Qe := min{l, |c|/yTi}. Then, w.h.p. the forest T \ consists 0{^/n) trees of depth O(yTi). 

Lemma 8.2. Let T be a rooted spanning tree of a cluster (multi-)graph Q and let TZbe a subset of the edges 
chosen at random as described above, and assume that the spanning tree of each cluster has depth at most 
d. W.h.p., the forest T \ fiJZ) induced by the edges T\TZ and the cluster spanning trees consists ofO{y/n) 
rooted trees of depth d + 0{^/n log n). 

Proof. Clearly, the number of trees in the forest induced hy Ep \ Ti is equal to \TZ\ + 1. The expected 
value for \TZ\ is given by the sum of the probabilities and thus E[|7?,|] < yTi. A standard Chernoff bound 
implies that \'R\ does not exceed ^/n by more than a constant factor with high probability. 

To bound the depth of each (rooted) tree in T \ consider a cluster c £ T and a path p from the 

leader {r} := £ n c to some node in the subtree Tr of T rooted at r. The depth of T,. n c is bounded by d. 
Denote by Ep the set of edges of p that correspond to edges in T, i.e., each e £ Ep satisfies that e = 
for some e' £ T. Denote by c(e) the child cluster of e, i.e., the endpoint further away from the root of T. By 
construction, c(e) G e' is also the cluster further away from the root of T (for the e' £ T with = e). 
Therefore, the length of p is bounded by 


d-\- \Ep\ + 'y ^ |c(e)| 

e^Ep 


i.e., the sum of the number of its edges in c, the number of edges between clusters \Ep\, and the number of 
edges in each traversed cluster. For e G Ep, we have that = min{l, \ c{e)\/^/n}, yielding that 


E[\nnf-\Ep)\] < 


E 

e&Ep 


c(e)| 



Applying Chernoff’s bound shows that K[\Tl n V' ^(L'p)|] G O(logn) or \Tlr\ ^(£p)| > 0 w.h.p. The 
former implies that (i) Yhe&E |c(e)| G 0{y/nlogn) and (ii) \Ep\ £ 0{y/nlogn), as always |c(e)| > 1; it 
follows that the length of p is bounded by d + 0{^/nlogn) as claimed. The latter implies that, w.h.p., p 
is not contained in T \ As the number of simple paths in a tree is bounded by 0{n^), applying the 

union bound completes the proof. □ 


For simplicity, in the following we assume that the high probability statements of the above lemma hold 
with certainty; the final statements then follow by applying the union bound. 

Throughout the construction, we will maintain that edges of TZ are never retained. As there will be 
o(logn) levels of recursion, by inductive use of the above lemma, it follows that clusters always have 
spanning trees of depth 0{^/n). Exploiting this property together with the small number of connected 
components of T \ 7^, we obtain a fast routine for the general case. 

Lemma 8.3. Within 0{y/n + D) rounds, for each edge e = (c, c) £ T, c can determine \ f'{e)\. 

Proof. Denote by C the connected components of T \ 7^. For c £ T, denote by C G C its connected 
component. We rewrite the (absolute value of) the multicommodity flow 


/'(c,c)|=^ ^ cap({ci,C 2 }„„) = |/'i(c,c)| + 1 / 2 ( 0 , c)| - 1 / 3 ( 0 , c)|, 

ClSTi; {Cl,C2} uv 

C20'c 
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where 


l/'i(c,c)|:= ^ ^ cap{{ci,C 2 }uv), 

ci£Tc\C {ci,C2} uv e£ 

C2^%\C 

l/2(c,c)| := ^ cap({ci,C2}«^), and 

ciSTLnC {ci,C2} e£ 

C2^Tc 

l/ 3 (c,c)|:= ^ ^ cap({ci,C 2 }«„). 

ciSTLnC {ci,C2} uv e£ 

C2&Tc\C 

Note that |/'^(c, c)| does only depend on the component of c, i.e., we need to determine and make 
known only \C\ G 0(^/n) values to cover this term. For the other terms, we will reduce the problem to an 
aggregation on the spanning tree of C in the vain of Lemma 8.1. 

Concerning |/i(c, c)|, we employ the following routine. 

1. Using its spanning tree, each component C £ C determines a unique identifier (say, the smallest 
cluster identifier) and makes if known fo all ifs nodes. 

2. For each {ci,C 2 }uv £ £,u and v exchange fheir componenf idenfifiers. 

3. The lisf of componenf idenfifiers and edges (C, C) for each (c, c) £ TZis made known fo all nodes. 
This enables each node fo locally compufe fhe free resulfing from confracing fhe componenfs C £ C 
in T. 

4. For each U G C, fix an arbifrary c £ C. Each node u £ Tc\C locally computes 

capciu) ■■= ^ cap{{ci,C2}uv)\ 

{ci,C2} UV &£ 

C2^Tc\C 

we sef capc(ri) := 0 for all u ^ 7^ \ C (nodes can defermine whefher fhey are in 7^ \ C based on fhe 
informafion collected in fhe previous fwo steps). 

5. For each C £ C, make Ylu&v capc('“) known fo all nodes via a BFS free of G. For all c G C, we 
have fhaf \f[{c, c)| = Yluev capciu). 

As discussed earlier, componenfs’ spanning frees have depfh 0{^/n) and \C\ £ 0{^/n). Hence, Step 1 fakes 
0{y/n) rounds and Steps 3 and 5 fake 0{^/n+D) rounds. Step 2 requires only one round of communicafion 
and Step 4 is local. Overall, fhe roufine requires 0{y/n + D) rounds. 

To defermine \f2ic, c)| and 1 / 3 ( 0 , c)| for each c, we proceed similarly fo Lemma 8.2. 

1. For each C £ C and each c G C, all nodes in c learn fhe lisf of ancesfors of c fhaf are in C (using fhe 
spanning free of C in G). 

2. For each {ci,C 2 }uv £ £, u and v exchange fheir componenf idenfifiers, as well as fhe ancesfor lisfs 
defermined in fhe previous sfep. 

3. The lisf of componenf idenfifiers and edges {G, C) for each (c, c) G 72 is made known fo all nodes. 

4. For each G £ C, c £ C, and u £Tcr\G,u locally computes 

capc('u) := ^ cap({ci,C 2 } uv) E cap({ci,C2} uv)‘ 

fci ,C2}it7; {ci,C2}ut;£^ 

vec2^Tc vec2&Tc\C 

®If V £ C, u can decide whether v £ Tc based on the ancestor lists. If v ^ C, u can decide whether v £ Tc based on w’s 
component identifier and the information collected in Step 3. 
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5. For each edge e = (c, c) G T, we aggregate capc(tt) on Tc n C, where Tc is the subtree of 

T (the spanning tree of G) corresponding to Tc- 
Note that, by definition of cap(,(M), we have that 

capc(u) = |/ 2 (c,c)| - 1 / 3 ( 0 ,c)|. 

ueTcCiC 

Hence, it remains to analyze the running time of this second subroutine. Again, using that components’ 
spanning trees have depth 0{y/n) and that \C\ G 0{y/n), we can conclude that Steps 1, 2, and 5 take 
0{y/n) rounds, while Step 3 takes 0{y/n + D) rounds. As Step 4 requires local computation only, the 
resulting running time is 0{y/n) rounds. Overall, we conclude that |/'(c, c)| can be computed for each c 
within 0{^/n + D) rounds, by running each of the two subroutines and summing up their outputs. □ 

8.2 Approximating ^ by a Distribution over Simpler Graphs 

Using the techniques of [27] and the above construction of low average stretch spanning trees, it is possible 
to design a distributed algorithm to compute a distribution of such spanning trees which approximates the cut 
structure of the underlying network graph within a factor (i.e., in the order of the average 

stretch of the computed spanning trees). However, when doing this, the number of spanning trees we need 
to compute can be linear in the size of Q. We follow the same general idea as Sherman, wo applied the 
construction by Madry [19] recursively to decrease the step complexity, to avoid this sequential bottleneck 
and achieve a small time complexity in the distributed setting. 

For each edge e G T, we define rload 7 -(e) := cap 7 -(e)/cap(e) > 1 fo be fhe relafive load of e 
(edges e £ S \ T have rload 7 -(e) = 0 ). The consfrucfion of [27] builds up a pofenfial for each edge e 
of G, where wifh each new free added fo fhe disfribufion, fhe pofenfial of e grows by a ferm proporfional 
fo rload 7 -(e)/ maXe'g£{rload 7 -(e')}. The pofenfial of each edge is bounded by a = and hence wifh 
every addifional spanning free, we are guaranfeed fo make progress for all edges e £ S wifh rload 7 -(e) close 
fo maXe'g£:{rload 7 -(e')}. In fhe worsf case, fhis can jusf be a single edge for each spanning free T- The key 
idea of Madry [19] is fo augmenf fhe free T wifh addifional edges in order fo reduce fhe maximum relafive 
load so fhaf in fhe new graph, a large number of edges have a relative load close fo fhe maximum one. 

Basically, we can reach a large number of edges wifh relafive load close fo fhe maximum relative load 
by repeatedly delefing fhe edge wifh largesf relafive load unfil a large number of fhe remaining edges has a 
relafive load fhaf is wifhin a consfanf faclor of fhe remaining maximum relafive load. When delefing some 
edges of T, one has fo add back some of fhe original edges of Q in order fo mainfain fhe properly fhaf G 
is embeddable info fhe resulfing graph. Formally, lef 7^ C 7” be a subsef of fhe spanning free edges. The 
edge sel T \ 7^ defines a spanning foresl of G consisfing of |7^| + 1 componenls. We define a subgraph 
7((T, T) of G as follows. The node sel of 1-L{T, T) is V. Furfher, 1-L{T, T) confains all edges in T \ 7^ and 
if confains all edges {c, £ 8 oiG for which c and d are in differenf componenls in fhe foresl induced 

by fhe edges in 7” \ 7^. Lei Ep be fhe sel of edges of 7f. We sel fhe capacilies cap-^(e) of edges e £ 8p to 
be cap-^(e) := cap- 7 -(e) if e G T \ 7^ and cap-^(e) := cap(e) olherwise. Note fhaf fhis guarantees fhaf G 
is 1-embeddable info 77. For fhe following discussion, we define EI[j] fo be fhe sel of graphs 77(7”, T) for a 
spanning free T of G and a sef of edges 7^ of T of size |7^| < j. 

Assume fhaf fhe weighled average slrefch of fhe spanning free T as given by Eq. (2) is upper bounded 
by a. Also recall fhaf we assume fhaf all capacilies of G are integers fhaf are polynomially bounded in fhe 
number of nodes n. As Ihroughouf our consfrucfion, each edge capacity always approximalely corresponds 
fo fhe capacity of some cuf in G, if is nof hard fo guaranfee fhaf all capacilies of G are integers befween 1 and 
poty(n). Given a spanning free T of G, lef R := maXe 67 -{rload 7 -(e)} be fhe largesf relafive load of all edges 
of T. In order fo delermine fhe sel of edges F, we slarf by partitioning fhe edges in T info imax = 0(log n) 
classes T^i,..., where class Fi confains all edges wifh relative load in {Rl2^, 77/2*“^]. Now, for any 
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jo < \T\, there exists an edge class Ti such that | Ui'<i< Jo and \Ti\ > jo/*max = fl(jo/logn); 
otherwise, \T\ = | Ui -^*1 < Jo < \'T\,n contradiction. We define T' := Ui'<i ^i'- 

In [20], this set of edges is used to construct the graph For the distributed computation, 

it will be useful to have a graph T-L{T,T) in which all the trees of the forest induced by \ T have 
small diameter. We therefore rely on the same technique as for computing the capacities of T and remove 
a few random additional edges of T. In fact, we can simply use the same subset of edges 7^ C T that 
has been determined and used before, prior to Lemma 8.2. We define T := T' U TZ and use fhe graph 
% = (V, cap-^) := T-L{T,T). Since all fhe edges of T wifh rload 7 -(F^) > Rl2^~^ are removed, all 
edges of R have relative load af mosf Furfher, all fhe Q,{j/ logn) edges of Ti have relative load 

larger fhan i?/2*. Based on Theorem 5.2 and Corollary 5.6 of [20] and on Theorem 3.1, we can show fhe 
following lemma. 

Lemma 8.4. Given are a distributed cluster (multi-)graph Q = (V, <5, capg) consisting of\V\ = N clusters 
and\£\ = N po\y\og{n) edges and a parameter j > 1 such that that j = log n). There is a distributed 

algorithm to compute an ^ -distribution of Q on N) . j^jj graphs, 

which runs in the CONGEST model on the underlying network graph G in (D + ^/n) ■ ■ N/j rounds. 

Proof. Lef a = 2 '^(v'iog"-iog log^T-) be the average stretch guarantee of the spanning tree algorithm. It follows 
directly from Theorem 5.2 and Lemma 5.5 in [20] that we can compute an (0 (q;), EI[|<S| • alog(n)/s])- 
decomposition of ^ on s graphs in time s ■ Ttree if the following conditions are satisfied: 

(1) The fime for computing one low average sfrefch spanning free is upper bounded by Ttree- 

(2) Given T and fhe sef of edges F as compufed above, lef rloadmax := ™aXgg- 7 -\j-{rload 7 -(e)}. The 
number of edges of 'H{T, F) wifh relative load af leasf rload max /2 is Tl{\£\a/s). 

As observed above, in T, all edges in fhe sef Fi have relative load befween rloadmax/2 and rloadmax- 
When consfrucfing 'H{T,F), fhe relative load of edges in T \ T does nof change and fhus, all nodes in 
Fi\F = Fi\Tl have relative load befween rloadmax/2 and rload max - Recall fhaf \Fi\ = Ct{j/ log n). 
By Lemma 8.2, wifh high probabilify, we have \Ti\ = 0{^/n). Since we assumed fhaf j = oj{^/nlogn), 

we have \Fi\ = a;(|7^|) and fhus IT] = Q.{j/ logn). The second condition is now satisfied by choosing 
s = 0(|£:|alog(n)/j) = 20(v'iogtviogiog7V) . 

Assuming fhaf fhe fime fo compufe a single low average sfrefch spanning free can be upper bounded 
by Ttree = {D + ^Jn) ■ fhe lemma now follows. By Theorem 3.1, fhis is guaranfeed as long as 
all edge lengfhs are infegers befween 1 and Inspecfing fhe consfrucfion in [20] and [27], we can 

observe fhaf fhe edge lengfhs cannof gef larger fhan a value exponenfial a. By rounding fhem fo infegers, 
we infroduce an additional mulfiplicafive error of facfor 2, which does nof affecf fhe asympfofic behavior. 
As a = the claim of the lemma follows. □ 

8.3 Transforming FLifT, ZF) into a j-Tree 

Given a graph 'H{T,F) G it remains to transform 'H{T,F) into an 0(j)-tree J such that the two 
graphs are 0(l)-embeddable into each other. In the following, we first describe the construction and we 
formally prove that the resulting 0(j)-tree J and the given graph (T, F) are 0(l)-embeddable into each 
other. We then show how to efficiently construct J" in a distributed way. 

Assume that we are given a spanning tree T of G and a graph 7f(T, F) which is constructed as described 
above. Consider the forest induced by the edges in T \ T. 

Let Pi C V be the set of clusters of T \ P which are incident to one of the deleted tree edges in F. 
We call Pi the primary portals of T \ F. Given Pi, we define fhe skelefon Sp^jr of T \ P as follows. 
Sp\jr is obfained from T by repeafedly delefing non-porfal clusfers of degree 1 unfil all remaining clusfers 
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are either in Pi or they have degree at least 2. Denote by P 2 all clusters of degree larger than 2 that are not 
primary portals; P 2 are the secondary portals portals. The set of all portal clusters now is P := Pi U P 2 . 
The skeleton S-p^jr is thus a forest consisting of a set of portals and paths connecting them, where all inner 
clusters of these paths have degree 2. 

Given the skeleton Sp^jr, consider one of these paths V. In the last step, we remove the edge with the 
smallest capacity from each such P. In doing so, we split the forest into trees so that each tree contains 
exactly one portal. It is straightforward to bound the number of resulting trees in terms of P. 

Lemma 8.5. Let P(P, P) G EI[j]. ie., |P| < j. Then, in the above construction, the total number of portal 
nodes is less than 4j. 

Proof. Clearly, |Pi| < 2|P| < 2j. As when computing the skeleton Sp\jr, non-portal clusters of degree 1 
are successively removed, we obtain a forest whose leaves are primary portals. As the sum of the degrees 
in an A^-node forest is at most 2{N — 1), the number of nodes of degree at least 3 is upper bounded by the 
number of leaves minus 2. We conclude that IP 2 I < |Pi| < 2j, and hence |P| < 4j. □ 

Finally, we identify each of the resulting trees with its portal and logically move all edges between 
different trees to the portals. For each edge e = {c, c'} £ S-}{\ (P\P) (i.e., each non-tree edge of P(P, P)), 
we add a virtual edge of capacity capg(e) between the portals of the trees containing c and c', respectively. 
Further, let V be the set of edges that were deleted from the paths of degree-2 clusters connecting portals in 
the skeleton. For every edge e G P, we add a virtual edge of capacity cap-^(e) = capp{e) between the two 
portals that were connected by the path from which e was deleted. 

Let us summarize this part of the construction; see Figure 5 for an example of a possible result. Starting 
from a forest P \ P, do as follows: 

1. Define Pi as the endpoints of edges in P; 

2. iteratively delete degree-1 clusters that are not in Pi until this process halts; 

3. define P 2 as the clusters retaining degree larger than 2 that are not in Pi and set P := Pi U P 2 ; 

4. delete from each (maximal) path without clusters from P the edge e G P of minimum capacity and 
replace it by an edge of the same capacity between its endpoints; and 

5. for each edge e £ Ep between different components of P\ (PU P), add an edge of the same capacity 
between the unique portals in these components. 

Hence, the resulting graph consists of the forest induced by T\ {FfD) and (possibly parallel) edges between 
the unique portals of the trees of the forest. By Lemma 8.5, the number of such portals is smaller than 4j, 
implying that the resulting graph is a 4j-tree. In the following, we denote this 4j-tree by J. 

Mutual Embeddability of TL {T, P) and J'. Before discussing how to efficiently construct (and represent) 
A in a distributed way, let us first show that TL{T, P) and J are 0(l)-embeddable into each other. 

The proofs of the following two lemmas is very similar to the corresponding result by Madry [19]. 
However, since our j-tree construction slightly deviates from Madry’s, the claims do not readily follow 
from any lemma in [19, 20]. 

Lemma 8.6. 'P(P, P) is 0{l)-embeddable into J. 

Proof. There are three types of edges of 'P(P, P) to distinguish: 

a) edges in (P \ P) \ P, 

b) edges in P,and 

c) the remaining edges from E-p connecting different trees of the forest P \ P. 

Case a) is the most straightforward, as all these edges are also present in J with the same capacity. Edges 
from e G P were deleted from a path V connecting two portals in the skeleton. In J, they can therefore 
be routed through the path V and the virtual edge with capacity cap 7 -(e) connecting the portal nodes at the 
ends of V. Because e is the lowest capacity edge of V, this adds relative load at most 1 to each edge of V. 
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Figure 5: An example j-txee at the second level of recursion. On the left side, green circles indicate the components of 
the forest of this j-tree, which are each made of a number of 1-clusters, indicated by blue double-line circular shapes. 
Edges inside 1-clusters are not shown. Solid straight green edges indicate virtual edges of level-1 that became edges 
of the level-2 forest. For each of these green edges, there is a real edge between some two nodes of the same two 
level-1 components (not shown). Brown edges represent (virtual) core edges, and they are mapped (blue arrows) to 
corresponding real edges (solid black). On the right side, real edges related to the virtual forest edges are represented 
as solid black edges between two nodes of the same connected component. 

Finally, let us consider one of the remaining edges e £ Sy,. The edge e = {ci, C2} connects two trees 
Ti / T 2 of J. Let us assume that ci G Ti and C 2 G T 2 . When routing from ci to C 2 , we follow 

1. the path from c\ in Ti to the first skeleton cluster si G Ti on the path to the (unique) portal pi G TiCiP, 

2. the skeleton path from si to pi, 

3. the virtual edge corresponding to e between pi and the (unique) portal p 2 £ T 2 Cl P, 

4. the skeleton path from p 2 to the last skeleton cluster S 2 £ T 2 when going from p 2 to C 2 in T 2 , and 

5. the path from S 2 to C 2 in T 2 . 

Let us compare the path from ci via si to pi with the path on which e is routed on the spanning tree T. The 
part from ci to si is also used when routing in T. If from si we follow the same direction to pi as in T, the 
two paths are, in fact, identical up the point when we reach pi. In this case, the capacities on this path suffice 
by consfrucfion, as we defined ca.p'j-{e') = \ f' {e')\. Lef us fherefore consider fhe case in which we sef ouf 
in the opposite direction from si on the skeleton path V 3 si connecting pi to some other portal p 2 than we 
would in T. In that case, routing on T would cross the edge e' £ V that was deleted from V. Because e' is 
the edge from V of smallest capacity, the contribution to the relative load of all edges crossed on V is upper 
bounded by the relative load contributed to e' when routing on T- Again, summing over all edges from Sy 
falling under Case c), this may increase their total relative loads only by an additive 1. Trivially, the third 
step causes relative load 1 on the virtual edge corresponding to e, since it is not used for routing any other 
edge. Reasoning symmetrically for Steps 4 and 5, we can conclude that embedding (T, P) into J' leads 
to constant relative load on all edges. □ 

Lemma 8.7. J is 0{1)-embeddable into Td{T, P). 

Proof. All the edges of the trees of J are also present in Tf (T, P) with the same capacity; they are hence 
straightforward to embed. Let us therefore consider the virtual edges connecting the portals of J. There 
are two types of virtual edges, the ones representing edges from D and those that correspond to the non-tree 
edges of (T, P\ We first have a look at a virtual edge e corresponding to an edge e' G V. The edge e is 


24 






routed by following path from which e' was removed. Since the capacity of each edge on the path is at least 
the capacity of e, this contributes at most 1 to the relative load of each edge. 

Now consider a virtual edge e = {c, c'} corresponding to an edge e' G E'n of 1-L{F, T). The edge e is 
routed on the trees of J the clusters c and c' reside in and via e'. The latter causes relative load 1, as e and e' 
have the same capacity and no other edge uses e'. Similarly to the embedding of (T, F) into J, the tree 
parts of the routing path that are subpaths of the path between c and c' in T and thus will not cause more 
than additive relative load 1 when summing over all edges of this type to embed. If we diverge from this 
path, this is because an edge from V lies on the routing path in T ; analogously to Lemma 8.6, following the 
skeleton path from which it was deleted to the respective portal increases the maximum relative load by at 
most an additional 1. □ 

Distributed Implementation. Let us now move to the distributed implementation of the above 4j-tree 
construction. Recall that because F includes the random set of edges TZ, by Lemma 8.2, all trees in T \ 2^ 
have depth 0{^/n). With this in mind, constructing the skeleton is fairly simple. 

Lemma 8.8. Given are a spanning tree F of a distributed cluster graph and the set of tree edges F as 
computed above. VZe can determine the skeleton S-pyjr, the set of portals P, and the set of edges T) (i.e., 
for {c, d^uv G T’, u and v will learn this) in time 0{^/nlog n) in the CONGEST model on the underlying 
network graph. In the same time, we can also orient the trees rooted at the portals. 

Proof. W.l.o.g., consider a single tree T of the forest T\F. By Lemma 8.2, the induced tree in G has depth 
0{y/n). Perform the following steps: 

• For each edge e ^ F, its incident clusters learn^ that they are primary portals, i.e., are in Pi. 

• Iteratively mark non-portal clusters with at most 1 marked neighboring cluster until this process stops. 
Unmarked clusters are in the skeleton. 

• Unmarked clusters with more than two unmarked neighboring portals are secondary portals. 

• The skeleton paths connecting portals find a minimum capacity edge and add it to V. 

• Each tree oiT\{F VJV) is rooted at its unique portal, whose identifier is made known to all nodes in 
the induced tree in G (together with clusters’ spanning trees). 

• These identifiers are exchanged with all neighbors in G. 

From the gathered information, for each edge {c, d^uv ^ J, u and v now can determine its membership 
and its capacity in J. Observe that the bound of 0{^/n) on the depth of the spanning trees of G leveraged 
for communication in the above construction implies that all the above steps can be completed in 0{^/n) 
rounds, which completes the proof. □ 

The trees rooted at the portals now induce the clusters of the new cluster graph. 

Corollary 8.9. Given a graph 'H{T,F) G EI[y/4] as computed above on a cluster graph whose clusters’ 
spanning trees have maximum depth d, there is an 0{D + d + y/n)-round distributed algorithm to compute 

• a cluster graph whose clusters’ spanning trees have depth d + 0{y/n); and 

• a j-tree J on this cluster graph, i.e., for each edge e a J, there is a corresponding graph edge 
{u, v} € E whose constituent nodes know that e ^ J as well as capj(e); such that 

• TL{T, F) is 1-embeddable into J and J is 0{l)-embeddable into TL{F, F); and 

• the new clusters are induced by the tree components of ff. 

Proof. This readily follows from Lemmas 8.2, 8.5, 8.6, 8.7, and 8.8. The only thing left to note is that 
clusters can learn the number of nodes they contain by a simple converge- and broadcast operation on their 
spanning trees. □ 

’a cluster for which edges to children are in J" may not “know” about its incident edges in as a whole, but determining 
whether there is at least one is trivial. 
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8.4 Sampling from the Recursively Constructed Distribution 


We have now all pieces in place to efficiently sample from a distribution similar to Sherman’s in a distributed 
fashion. The difference is that Theorem 3.1 and thus Lemma 8.4 merely give a G im¬ 

plying that we must use fewer levels of recursion to ensure that the final approximafion guaranfee of fhe 
congestion approximator will remain in 

Theorem 8.10. W.h.p., within {y/n + rounds of the CONGEST model, we can sample a tree T 

from a distribution ofn^~^°^^'^ (virtual) rooted spanning trees on G with the following properties. 

• For any cut of G of capacity G, the capacity of the cut in T is at least G. 

• For any cut of G of capacity G, the expected capacity of the cut in T is at most aG, where a G n°^^\ 

• The distributed representation ofT is given by a hierarchy of cluster graphs Qi = (Vi, Ft, Ci, Tj, 'i/’j), 
i G {0,..., fo}, fo £ o(logn), on network graph G, with the following properties. 

— The spanning trees of the clusters ofQi have depth 0{y/n). 

- |Vio| = 

— Qi is the (rooted) tree resulting from T by contracting the clusters ofQi. 

— For i > 0, Qi is also a cluster graph on network graph Qi-\. 

— For f > 0, each cluster Ci G Vj ofQi, interpreted as cluster graph on Qi-i, contains a unique 
porfal cluster p{ci) G Vi_i ofQi-i that is incident to all edges of Qi containing Ci. That is, 
Qi-i is a \ Vi\-tree with core p{Vi). 

Proof. In the following, we will use w.h.p. statements as if they were deterministic; the result then follows 
by taking the union bound over all (polynomially many in n) such statements we use. 

Set /3 := To start the recursion, we will use G as cluster graph of itself. Formally, Qo := 

(v,i^,v,{(W, 0 )W, id), where id is the identity function. We perform the following construction until 
it terminates: 

1. Sparsify Qi-i using Lemma 6.1 for some fixed consfanf e, e.g., e = 1/2. This fakes (^/n + 
rounds. Multiply all edge capacities by 1/(1 — e) (so Qi-i can be 1-embedded into the sparser graph). 

2. If |Vj_i| ^ / logn), set iq := i and stop. This takes 0(D) rounds by communicating over a 

BFS tree of G. 

3. Apply Lemma 8.4 for y = |Vi_i|/(4/3) to the sparsified cluster graph; by the previous step, this choice 

of j is feasible. As |Vi_i|/y = 4/3 G constructing the distribution requires (^/n + 

rounds in total. 

4. Sample a cluster graph from the distribution. This is done in 0(D) rounds letting some node broadcast 
0(log n) random bits over a BFS tree. 

5. Apply Corollary 8.9 to extract a |Vj_i|//3-tree of Qi-i. The corollary also yields a cluster graph Qi 
(which is also a cluster graph on network graph Qi-i) so that each of its clusters a contains exactly 
one portal cluster p(ci) of the |Vi_i|//3-tree on Qi-i. This step completes in 0(^/n+D) rounds: there 
are fewer than log^ yTi <C logn iterations of the overall construction, as |Vi| < |Vj_i|//3, implying 
that d G ()(^/n) for each application of Corollary 8.9. 

6. Recurse on Qi, i.e., set f := f -|- 1 and go back to Step 1. 

When the above construction halts, we have that |ViQ_i| = 0(yTr/3) = Thus, we can make the 

(sparsified) cluster graph | known to all nodes in (yTr -|- D)n°^^^ rounds via a BFS tree of G. We then 
continue the construction locally without controlling the size of components, which removes the constraint 
on j when applying Lemma 8.8, until the core becomes empty, i.e., we construct a tree.^ We collapse the 
cluster graph hierarchy for all locally performed iterations i > io, which defines fhe free Qi^ on clusters Ci^ 
(fhis is feasible as each Qi, / > 0, is also a cluster graph on nefwork graph Qi-i). 

*Note that the corresponding physical edges in G may still connect to different sub-clusters of d. 

®This is essentially Sherman’s construction on the small constructed cluster graph. 
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This completes the description of the algorithm. Summing up the running times of the individual steps 
and using that zq = o(logn), we conclude that the construction takes {^/n + rounds. The con¬ 

struction also maintained the stated structural properties of the cluster hierarchy. Hence, it remains to show 
that (i) we sampled from a distribution of trees and (ii) the stated cut approximation properties are 

satisfied. 

Showing these properties now is straightforward. In each step f > 0 of the recursion, by Lemma 8.4 
we constructed a distribution on 0(/3) | Vi_i|-trees. The total number of recursive steps (including the local 
ones), is bounded by [log^n] = 0(log^/^ n), as |Vi| < |Vj_i|//3 for each i > 0. On each level of recursion, 

we compute a distribution on 1^*1 graphs. Hence, the total number 

of virtual trees in the (implicit) distribution of virtual trees from which we sampled is bounded by 

/20( Vlog ri log log n) ^ _ 20( Vlog 'n log log n logL^ „) _ ^i+o(l) 


Consider a cut of G of capacity C. By the properties of decompositions and the fact that we multiplied 
capacities by 1/(1 — e) whenever we sparsified, G is 1-embeddable into any of the trees we might construct, 
implying that the corresponding cut of the sampled tree has capacity at least C. As in each step, we (i) apply 
a (l+e)-sparsifier and multiply capacities by 1/(1 —e) for constant e, (ii) construct a jj). 

decomposition (for some family H) from which we sample, and (iii) transform the resulting graph into a j- 
tree which can be 0(l)-embedded into the graph from which it is constructed, we overestimate the capacity 
of a given cut by an expected factor of 20 (Viogniogiogn) . ^ in each step. Using 

that this bound is uniform and the randomness on each level of recursion is independent, it follows that the 
expected capacity of a cut of G of capacity C in the sampled virtual tree is bounded by 


^20(Vlogn.loglogn)^ _ 20( Vlog n log log n logL 4 „) _ ^o(i) 


□ 


9 The High-Level Algorithm 

The algorithm is a distributed implementation of Sherman’s algorithm [30]. It consists of a logarithmic 
number of calls to algorithm AlmostRoute, described in Section 9.1, and one computation of a maximum- 
weight spanning tree and routing the left-over demand through this tree. Pseudocode for the top-level 
algorithm is presented in Algorithm 1 . 


Algorithm 1 Max Flow. Input: demand vector b G M”; output: flow vector / G M™. 

1: bo ■(— b; f Q ■(— 0 

2: for i ^ 1 to (log m -I-1) do 

3: f^ •!— AlmostRoute(bi, f) 

4: bi <— bi-i — 

5: Compute a maximum weight spanning tree T on G, where weights are the capacities of edges. 
6: Route the residual demand bt through T; let frp be the resulting flow. 

7: Output+ 


Most of this section is dedicated to explaining how to implement the AlmostRoute algorithm. Let us 
first quickly outline how we implement the final steps using standard techniques. 

Lemma 9.1. Steps 5-6 Can be implemented in the CONGEST model in 0{D + yTr) rounds w.h.p. 
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Proof sketch. A maximum weight spanning tree T can be computed in 0{D + ^/n) rounds using the 
minimum weight spanning tree algorithm of Kutten and Peleg [18] (say, by assigning weight w{e) := 
—cap(e) for each edge e). To compute the flow, we use the following observation: if T was rooted at one 
of its nodes, then to route the demand bt over T, it would be sufficient for each node v to learn the total 
demand in the subtree rooted at v. In this case each node v assigns dy units of flow to the edge leading 
from V to its parent. 

We now show how to root the tree and find the total demand in each subtree in 0{D + y/n) rounds. The 
algorithm is as follows. Remove each edge of the tree independently with probability Xjyfn. W.h.p., 

(i) each connected component induced by the remaining edges contains has strong diameter Oi^y/n), 

(ii) 0{y/n) edges are removed, and hence 
(hi) the number of components is 0{y/n). 

Within each component, all demands are summed up, and this sum is made known to all nodes. The 
summation takes 0(y/n) rounds due to (i), and we can pipeline the announcement of the sums over a BPS 
tree in 0{y/n + D) rounds due to (iii). 

Moreover, in this time we can also assign unique identifiers to the components (e.g. the minimum iden¬ 
tifier) and make the tree resulting from contracting components globally known. Using local computation 
only, nodes then can root this tree (e.g. at the cluster of minimum identifier) and determine the sum the 
demands of the clusters that are fully contained in their subtree. Using a simple broadcast, the orientation 
of edges within components is determined, and using a convergecast on the components, each node can 
determine the sum of demands in its subtree. These steps take another 0(y/n) rounds. □ 

9.1 Algorithm AlmostRoute: The Gradient Descent 

We now explain how to implement Algorithm AlmostRoute in a distributed setting. The idea is to use 
gradient descent with the potential function 

</>(/) = smax(C“^/) + smax(2aii(& — Bf)) , 

where the “soft-max” function, defined by 

smax(^) = log for all y G , 

is used as a differentiable approximation to the “max” function. 

Given this potential function, AlmostRoute performs 0(a^e“^ logn) updates on 
f that optimizes the potential function up to a (1 -|- e) factor.Pseduocode for this 
Algorithm 2. 

''’Sherman claims that one can save a factor of 1 /e by a more careful scaling [ 30 ]. 


f and outputs a flow 
algorithm is given in 
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Algorithm 2 AlmostRoute(b, e) 

1: fcf, ^ 2a ||i?b||^ £/(161ogn); 6 ^ fcfcb. 

2: repeat 

3: kf ^ I 

4: while < 16£“^ logn do 

5: f^f- (17/16); b^b- (17/16); kf ^ kj ■ (17/16) 

6: <5^Ee£^j|cap(e)|^| 

7: if5>£/4theii 

8: /e^/e-sgn(|^) •cap(e)j:pf^ 

9: else 

10: /e •<— fe/kf for all edges e G E. 

11: ^ 6„/(fcf,fc/) for all nodes ?; S y 

12: return 

13: until done 


To implement this algorithm in a distributed setting, we need to compute R, and to do multiplications 
by R or its transpose R^, distributively. These multiplications are needed for computing </>(/) and and its 
partial derivatives. We remark that R and R^ are not constructed explicitly, as we need to ensure a small time 
complexity for each iteration. Assuming that we can perform these operations, each step of AlmostRoute 
can be completed in 0(11) additional rounds. 

We maintain the invariant that at the beginning of each iteration of the repeat loop, each node v knows 
the current flow over each of the links v is incident to, and the current demand at v (i.e., (b — Bf )^). Let us 
break the potential function cj) in two, i.e., 

4>if) = />i(/) + 4>2{f) , where (j)i{f) = smax(0“V) and 4>2{f) = smax(2ai?(b - Bf)) . 

We proceed as follows. First, we compute (piif)'. to find smax(0“^/), it suffices to sum exp(/e/cap(e)) 
and exp(—/e/cap(e)) over all edges e, which can be done in 0{D) rounds. As Sherman points out, 
(/>(/) = 0(e“^logn) due to the scaling, and thus, encoding exp((/(/)) with sufficient accuracy requires 
0(e“^ log n) bits, which is thereby also a bound on the encoding length of all individual terms in the sums 
for cj)i and (/) 2 . The error introduced by rounding theses values to integers is small enough to not affect the 
asymptotics of the running time. 

For determining (j) 2 {f), we compute the vector y := 2aR{b — Bf) and then do an aggregation on a 
BFS tree as for (/>i(/). Since Bf can be computed instantly {{Bf^ is exactly the net flow into v), this boils 
down to multiplying a locally known vector with R. Before we discuss how implement this operation, let us 
explain more about the structure of R and how we determine which is required in Lines 6 and 8 of the 
algorithm. 

The linear operator R is induced by graph cuts. More precisely, in the matrix representation of R, there 
is one row for each cut our congestion approximator (explicitly) considers. We will clarify the structure of 
R shortly; for now, denote by I the set of row indices of R. Observe that 

d(l) ^ exp(/e/cap(e)) - exp(-/e/cap(e)) 
dfe cap(e)exp((/>i) dfe 

and hence, given that (pi is known, the first term is locally computable. The second term expands to 

dj)2_s:^d(h dyi _ \ - exp(?/i) - exp(-j/i) 2aBi^e 

7ti exp((/ 2 ) cap(i) ’ 
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where cap(i) is the capacity of cut i in the congestion approximator and Bi^e G {—1,0,1} denotes whether 
e is outgoing (—1), ingoing (1), or not crossing cut i}^ 

The cuts i G / are induced by the edges of a collection of (rooted, virtual, capacitated) spanning trees 
T, where for T G T we write (u, v) G T if u is the parent of v and denote by Tv the subtree rooted at v. For 
each T G T, each edge (u, v) £ T now induces a (directed) cut (T^; T) with index i{T, {v, v)). We denote 
the set of edges crossing this cut by by 5{Tv)- Let us also define 


p{T,v) 


{>))') yi(T,(v,v))) 2o; 

exp((/>2) ca-p^{{v,v)) ' 


With this notation, we have that 


d4>2 

dfe 


E E p{T,v) ■ 
TST {v,v)£T 
ee<5(r„) 


We call p{T, (v, v)) the price of the (virtual) edge (v, v) G T. Let Pv,T denote the edge set of the unique 
path in T from v to the root of T. We define a node potential for each node v by 

pi'TAw,w)). 

T ST {w,w)CiPv,T 


For any e = {u, v), fhe cufs induced by edges in T G T fhaf e crosses correspond fo fhe edges on fhe unique 
pafh from u fo n in T. For all edges {w,w) G T on fhe pafh from u fo fhe leasf common ancestor of u and 
V in T, while -Bj(r,(iu,iu)),e = +1 for th® edges on fhe pafh befween v and fhis leasf 

common ancesfor. Thus, 

9 ( 1)2 , 

=7T^-TTu, (4) 

dfe 

and our fask boils down fo determining fhe value of fhe pofenfial tt^ af each node v £ V. To fhis end, we 
need fwo key subroufines to compufe disfribufively fhe following quantifies. 


(1) Hi for each cuf i. Note fhaf h — Bf is known disfribufedly, i.e., each node knows ifs own coordinate of 
fhis vecfor. For each free in T G T, we need fo aggregafe fhis informafion from fhe leaves fo fhe roof. 
This means fo simulate a convergecasf on fhe virfual free T- 

(2) TTv for each node v. Provided fhaf each (virfual) free edge knows ifs y-value and (^ 2 ^ the prices can 
be compufed locally. Then fhe confribufion of each free fo fhe node pofenfials can be computed by a 
downcasf from fhe corresponding roof fo ifs leaves. 


Wifh fhese roufines, one iteration of fhe repeat loop is now execufed as follows: 

1. Compufe (j)i, y (local knowledge), and (j )2 (aggregation on BFS free once y is known). 

2. Check fhe condition in Line 4. If if holds, locally updafe b, f, and kf, and go fo fhe previous sfep. 

3. Compufe fhe pofenfial tt (local knowledge). 

4. For each e £ E, ifs incident edges determine ^ (based on Equations 3 and 4, it suffices fo exchange 
TTu and TTv over e). 

5. Compufe 6 (aggregation on BFS free). 

6. Locally updafe /e and bv for all e G and v £ V. 

Nofe fhaf all of fhe individual operations excepf for compufafion of y and tt can be completed in 0{D) 
rounds. Sherman proved [30] fhaf AlmostRoute terminates after 0(e“^a^) iterations. As if is only called 
O(logn) times by fhe max-flow algorifhm. Theorem 1.1 follows if we can compute y and tt in iVn + 
D)n°9') rounds for an a-congesfion approximator wifh a = this is subjecf of fhe nexf subsection. 

"Technically, Bi^e = Tves- where Si is the set of nodes defining cut i. 
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physical edge p(e). 



Figure 6: Hiearchical cluster decomposition of a virtual tree T S T. Black edges are virtual tree edges, which are 
represented by a physical edge connecting the top-level clusters they connect (the orange dotted edge p(e) corresponds 
to the edge labeled e). Each cluster is spanned by a tree in G of depth 0{y/n), which is not shown. 

9.2 Congestion Approximation 

Our congestion approximator R is defined by the edge-induced cuts of a sample T of virtual trees T from a 
recursively constructed distribution. The trees are represented distributedly by a hierarchy of cluster graphs 
(see Figure 6 for an illustration and Section 5 for the formal definition of cluster graphs). Intuitively, a 
cluster graph partitions the nodes into clusters, each of which has a spanning tree rooted at a leader, and a 
collection of edges between clusters that are represented by corresponding graph edges between some nodes 
of the clusters they connect. In Section 8, we have shown the following theorem. 

Theorem 8.10. (restated) W.h.p., within (yTr + D)'nP^^'^ rounds of the CONGEST model, we can sample 
a tree T from a distribution ofn^~^°^^^ (virtual) rooted spanning trees on G with the following properties. 

• For any cut ofG of capacity G, the capacity of the cut in T is at least G. 

• For any cut ofG of capacity G, the expected capacity of the cut in T is at most aG, where a G n°^^\ 

• The distributed representation ofT is given by a hierarchy of cluster graphs Qi = (Vi, Si, Ci, Tj, ipi), 
i G {0,..., io}, io G o(logn), on network graph G, with the following properties. 

— The spanning trees of the clusters ofQi have depth 0{^/n). 

- I Viol = 

— Qi is the (rooted) tree resulting from R by contracting the clusters ofQi. 

— For i > 0, Qi is also a cluster graph on network graph Qi-i. 

— For i > 0, each cluster Ci G Vi ofQi, interpreted as cluster graph on Qi-i, contains a unique 
portal cluster p{ci) G Vi-i ofQi-i that is incidenS^ to all edges ofQi containing a. That is, 
Qi-i is a \ Vi\-tree with core p{Vi). 

The first two properties of each T stated in the theorem imply that we can use them to construct a good 
congestion approximator R. More precisely. Lemma 3.3 implies the following corollary. 


'^Note that the corresponding physical edges in G may still connect to different sub-clusters of a. 
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Corollary 9.2. Sampling a collection T of 0{\ogn) virtual trees given by Theorem 3.2 and using them as 
congestion approximator R in the way specified in Section 9.1 implies that the total number of iterations of 
Algorithm 2 is 

All that remains now is to show that the distributed representation of each sampled T £ T allows to 
simulate a convergecast and a downcast on T in {^/n + rounds: then we can implement the key 

subroutines (1) and (2) (i.e., compute y and tt) outlined in Section 9.1 with this time complexity, and by 
Corollary 9.2 the total number of rounds of the computation is bounded by {y/n + D)'nP^^'>. 

Fortunately, the recursive structure of the decomposition is very specific. The cluster graphs of the 
different levels of recursion are nested, i.e., the clusters of the {i — 1)*^ level of recursion are subdivisions 
of the clusters of the level. What is more, each cluster is a subtree of the virtual tree and is spanned 
by a tree of depth 0{y/n) in G (cf. Figure 6) Hence, while the physical graph edges representing the 
virtual tree edges are between arbitrary nodes within the clusters they connect, we can (i) identify each 
cluster on each hierarchy level with the root of the subtree induced by its nodes, (ii) handle such subtrees 
recursively (both for convergecasts and downcasts), (hi) on each level of recursion but the last, perform the 
relevant communication by broadcasting or upcasting on the underlying cluster spanning trees in G of depth 
0{y/n), and (iv) communicate over a BFS tree of G on the final level of recursion, where merely 
clusfers/nodes of fhe virfual free remain. 

Corollary 9.3. On each virtual tree T G T, we can simulate convergecast and upcast operations in 0{y/n+ 
D) rounds. 

Theorem 1.1 now follows from Sherman’s resulfs on fhe number of iferafions of fhe gradienf descenf 
algorifhm [30], fhe discussion in Secfion 9.1, and Corollaries 9.2 and 9.3. 
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